Wednesday, August 3, 2016

Reading ASCII file in Python3.5 is 2-3x faster as bytes than string

I wrote an essay which compared the read performance in Python3.5 between bytes and the Unicode text options 'newline' and 'encoding'. I concluded that I couldn't get the Unicode string performance to within a factor of 2 of the binary byte performance, so chemfp will be working with bytes, not strings.

I also checked how the RDKit handles invalid Unicode, to see what another toolkit did for the same problem. I concluded that it uses bytes internally and exposes strings, which causes problems if those bytes cannot be converted to strings.

This is the place to leave comments about that post.

Sunday, July 6, 2014

The origin of the connection table

I've been trying to understand the origin of the connection table, and the origin of the term "connection table." The full details are in my essay "The origin of the connection table."

My investigations lead me to believe that Calvin Mooers in 1951 described the first practical connection table, which was used by many people in the 1950s and 1960s. In the mid-1990s, people started saying that George Wheland in 1949 was the first to describe the connection table. I investigated that earlier claim. Wheland's text book does not describe a practical connection table for general purpose use, nor was the proposal suggested for use by a computer. Wheland brought it up to emphasize that nongeometrical representations were equally as descriptive as diagrams, but did not believe that the connection matrix was of practical use.

I then tried to figure out why people say that Wheland is the creator of the connection table, but that's unresolved.

Finally, I tried to figure out when the term "connection table" was coined. It appears to be 1963, from people working at or affiliated with Chemical Abstracts, and perhaps due to influences from electrical engineering.

If you have comments about that essay, leave them here.

Wednesday, June 18, 2014

Calvin Mooers

I wrote an essay describing my current understanding of the ideas of Calvin Mooers regarding chemical database search, and how they affected the early decades of cheminformatics. This is necessarily incomplete as the details took place over 50 years ago and I don't have access to some of the critical records. If you have comments or questions, here's the place to voice them.

Saturday, July 27, 2013

Comments about 'Varkony Reconsidered'

I presented a talk titled "Varkony Reconsidered: Subgraph enumeration and the Multiple MCS problem" at the 6th Joint Sheffield Conference on Chemoinformatics on 23 July 2013.

The major goal of the presentation was to show how the 1979 paper of Varkony, which is part of the MCS literature, is better seen as an early paper in frequent subgraph mining. My own MCS algorithm, fmcs, is a variation of that approach, and is best seen as an intellectual descendent of an algorithm by Takahashi. Both the Varkony and Takahashi papers are relatively uncited in the MCS literature, so I spend some time explaining how they fit into the larger context.

The full MCS talk is on my web site. This is the place to leave comments.

Monday, May 21, 2012

Testing hard algorithms

Programming is hard. Various techniques help simplify programming, but sometimes the only way to implement something is to think a lot while writing a thousand lines of lightly-tested code, then put the code through a diverse set of tests until you're not longer worried that it's going to fail in unexpected ways. I wrote a long, reflective essay on testing hard algorithms, using my recent MCS work to provide structure.

Thursday, May 17, 2012

Topologically non-planar molecules

Chemists and mathematicians interpret "non-planar" differently. Very few small molecular graphs are non-planar, in the mathematical sense. Still, they do exist. In this essay, I managed to find some in the PubChem database.

Saturday, May 12, 2012

Maximum Common Substructures and fmcs

Here's the place to comment about my posts related to my maximum common substructure algorithm, fmcs tool based on the algorithm, and MCS benchmarking. The relevant essays are: