CaffeineFix

Version 2.0 [201211]

Chemical Nomenclature Spelling Correction

The complex technical nomenclature used by chemists to describe molecular structures presents unique challenges to regular natural language processing (NLP) tools. Software for handling English text often can't handle the non-standard use of whitespace, hyphenation, punctuation, Greek characters, italics and even superscripts found in chemical names. Likewise, the unusual letter combinations that occur in IUPAC, Chemical Abstracts, Beilstein and traditional names can trip up the trigram analysis frequently used in spell checking software.

CaffeineFix overcomes the limitations of existing solutions by using novel algorithms purely for handling IUPAC-like organic chemistry nomenclature. Unlike dictionary-based approaches, CaffeineFix's novel "push-down automaton" technology allows it to check and correct against an infinite number of words/chemical names. "Levenshtein distance" can be used to identify corrections and automatically correct unambiguous errors. User parameterizable substitution matrices and insertion-deletion (indel) penalties can be used to customize suggestion scores for a particular end-user application. For example, when working with OCR scanned text the cost of substituting "rn" with "m", or "l" with "1" can be reduced, or when the relative cost of deleting hyphens tweaked when processing paginated text.

12-dichlorobenzne? Did you mean 1,2-dichlorobenzene?
didec-2-ene? Did you mean dodec-2-ene?

Correct typos in documents within word processors, spreadsheets, presentation and other office software.
Enhance text-based chemical database searches of registration systems or chemical supplier catalogues, with Google-like "did you mean ...?" functionality.
Fix OCR errors in scanned documents.
Improve the performance of name-to-structure conversion in text-mining and chemical entity extraction applications.

Arthor provides fast state-of-the-art substructure and chemical similarity search capabilities for ultra-large databases of hundreds of millions of compounds, using SMARTS optimization, Just-In-Time compilation and/or GPUs.

CaffeineFix is used to rapidly match chemical names or terms against a dictionary or grammar (e.g. a grammar for IUPAC names). As well as use in text-mining, it can be used to provide autocomplete functionality and spell-correction.

Casandra is a server for delivering real time safety warnings of experimental hazards straight to the pharmaceutical electronic laboratory notebooks (ELNs).

HazELNut is a suite of tools used to extract, normalize and analyse information in Electronic Lab Notebooks (ELNs). This can be used to implement a search interface, find/eliminate duplicates, find similar reactions and so on.

LeadMine extracts chemical names and terms from text. It incorporates NextMove's CaffeineFix technology to find terms that match appropriate dictionaries or grammars. It has enhanced functionality to handle the patent literature.

Matsy is a set of tools for creating and analysing Matched Molecular Series (the general form of Matched Molecular Pairs). In particular, it can be used to suggest what compound to make next in a Medicinal Chemistry program.

MPSearch rapidly searches a database to find Matched Pairs related to a query molecule. This type of search is used to explore previous medicinal chemistry strategies.

NameRXN is used to classify and name reactions. It is particular useful in the context of ELN analysis but also as a plugin to chemical drawing software. NameRXN builds on NextMove Software's Patsy technology.

Patsy is used to speed up SMARTS pattern matching by creating optimized SMARTS patterns or source code. Speed gains are particularly large when multiple SMARTS patterns are matched against a single structure.

Pistachio is a reaction dataset browser providing loading, querying, and analytics of chemical reactions. With over 21 million chemical reactions extracted from US & EPO patents, it demonstrates an AI interface to faceted (structure) search

SmallWorld is an index of chemical space based on more than 230 billion molecular substructures. It can be used to measure similarity based on graph-edit distance, find the MCS of two or more molecules, analyse HTS results and much more.

Sugar & Splice can be used to perceive and depict biopolymer structure. It makes it easy to interconvert between small-molecule representations (e.g. SMILES, MOL) and biopolymer representations (HELM, IUPAC line notation).

General Inquiries: info@nextmovesoftware.com Support: support@nextmovesoftware.com

CaffeineFix

Chemical Nomenclature Spelling Correction

General Inquiries: info@nextmovesoftware.com
Support: support@nextmovesoftware.com