Pistachio
[Version 2024-10-31 (2024Q3)]
Reaction Data, Querying and Analytics
Pistachio is a reaction dataset and interface providing loading, querying, and analytics of chemical reactions. Pistachio builds on and extends existing solutions from NextMove Software to enrich reaction data and provide powerful query capabilities.Figure 1. Pistachio Architecture
Reaction Data Reaction data can be obtained from an ELN export (HazELNut), external dataset (Reaxys), or mined from journals or patents. Patents provide a large accessible collection of documents for mining and hence are used for demonstration purposes. Data is mined from documents in three ways (Fig. 1). Patent Reaction Extraction uses LeadMine and ChemicalTagger to extract reactions and physical quantities from experimental paragraphs[1]. Indigo atom-mapping is then used to filter out suspect reactions and is a major bottle neck. Praline reads ChemDraw CDX files supplied in the U.S. Patents converting and interpreting exemplified reactions and schemes. LeadMine is used to create tables of bibliography data (author, document codes) and diseases (MeSH terms) from title and the claims section. These datasets are merged into a JSON file with full reaction details and a denormalised table for indexing in PostgresSQL.
Figure 2. Query Tagging
Query Handling Pistachio queries are input in an omnibox, the text is parsed using LeadMine and an expression tree built, the expression is then turned into a SQL query. The following basic data types are supported:
-
Compound
- SMILES
- SMARTS
- Trivial Name
- Line Formula
- Systematic Name
- SMARTS
- Reaction Type (NameRxn)
- Yield
- Affiliation (Assignee)
- Author (Inventor)
- Publication Date
- Document Name (Parent No.)
- Document Codes (IPC)
- Disease Terms
The compound types can be further constrained by component role (e.g. product) and search type (e.g. substructure, synthesis). Logical operators (AND, OR, NOT) can be used between terms and grouped with parenthesis, when absent (Fig. 2) implicit AND is implied.
The following video demonstrates the querys and results in real time
See also
- John Mayfield et al., Pistachio, NIH Virtual Workshop on Reaction Informatics. May 2021
- 13,118,970 Reactions and Counting (Blog post)
- John Mayfield et al., Pistachio: Search and faceting of large reaction databases. ACS Fall 2017
- Daniel Lowe. Extraction of chemical structures and reactions from the literature. Ph.D. Thesis, 2012
- Postgres ltree extension
- John May and Roger Sayle. Substructure Search Face-off. Presented at CCNM on 27-May-2015