Going Polygot: Interoperability

In this series, I will journal my efforts to transpile JSON data schemas for commands in our TypeScript system to Pydantic data classes. This will allow our team to regain static type safety while accessing the best the Python ecosystem has for text and media file mainpulation in bulk data ingestion pipelines.

  1. A command line tool to write the schemas ** COMING SOON
  2. Reading the JSON schemas into Python
  3. Understanding Abstract Syntax Trees in Python
  4. Generating ASTs for Pydantic data class definitions from the schemas
  5. Generating the code from an AST
  6. A complete pipeline

Feel free to jump into part 1. If  you're interested in the back story, read on.

Background

For the past decade, I've lived a double life. I've spent half my time in the world of Domain Driven Design, implementing full-stack solutions to enterprise problems. I've spent the rest of my time analyzing data, training Machine Learning models, creating Natural Language Processing pipelines, and in general wrangling data. I am a big believer in using the right tool for the job, and have found TypeScript to be a good tool for full-stack applications, but Python (and especially its ecosystem) to be the right choice for big data tasks.

At some point, my worlds collided on a single project. While working on the Web of Knowledge platform, we realized that many user groups has extensive data that they wanted to import. Further, each tenant has unique needs for post-processing of the data. Long story short, we came to the point where formalizing some ad-hoc Python scripts and notebooks into proper tooling was a great business investment.

How did we get here?

Our system exposes the ability to update the application state via a command system. Consistent with Domain Driven Design and CQRS, each command is a request to update a single aggregate root (root entity that controls state changes to all nested entities). This allows us to create workflows that express changes in the ubiquitious language. Further, whenever a command succeeds, we get one or more events. The events serve two purposes.

  1. They provide a fine-grained autible log of how we got here that can be replayed at any point in time up to any point in time.
  2. By publishing the events on a messaging queue (or event bus in early stages of the architecture), we can introduce as many decoupled event consumers as we'd like. We have a set of consumers that maintain materialized views to communicate with web and mobile clients, and these are easy to scale horizontally. Full-text search via custom indexes becomes extremely simple. We can have additional views to eagerly perform feature extraction and compile datasets for training custom machine learning models on our data.

While we expose a command endpoint over a REST API, it was important that our service and domain logic not leak into the HTTP Adapter.

Command Line Interface

We had the need to import and export data. At this point, we developed a CLI interface for this purpose. We soon realized that because our service layer was decoupled from the REST API, we could easily write CLI bindings for commands (or queries for that matter). In fact, we can even fire our system up and drive the domain in REPL mode!

At this point we added a command to execute-command-stream. We could either pipe in a serialized JSON array of commands or use a json file to store such data. At first this was useful for seeding staging instances with demonstration data.

Over time, we started to accumulate disjoint, untracked, ad-hoc tooling amongst ourselves. When the need to ingest data for production came up, we knew it was time to bring this tooling in-line with the standards for our domain code.

I myself started with some proof of concept notebooks using Python. I wrote a Python wrapper around the CLI so that I could either execute commands from Python, which I did using Jupyter notebooks. The biggest struggle was the sudden loss of static type-safety and struggles with building up commands as Data Transfer Objects from within Python. However, I felt it was a fine tradeoff, because tasks like parsing CSV into a pandas dataframe for further preprocessing, or parsing a docx are quite routine in Python. In the JavaScript world, I often have to choose between 10 poor choices and end up opting to write my own libraries to do such tasks (because I can).

I was familiar with FastAPI, having used this to wrap an API around a custom built langauge model using Spacy for a separate project. I was quite impressed with Pydantic and struck by its similarity with our TypeScript solution for managing schemas for DTOs, Events, Commands, etc. We implemented a custom set of decorators and data-types along with validation tooling that allows us to validate schemas at the boundaries of our system, leaving our multi-paradigm document store \ graph database to focus on simple persistence, and further leaving all "invariant validation" completely database agnostic. At the end of the day, for each domain model, command, event, or view model within our system, we can use reflection to get a custom COSCRAD Data Schemaas a plain old JavaScript object, or write this to json.

It seems natural to me to translate command schemas into Pydantic class definitions, and thereby recover type inference in our data ingestion pipelines. I decided to pursue this task on my own time as a learning exercise that could have very practical payoff.

Note that translating our event schemas to Python or even Scala type classes is a project on the near horizon!