Tools To Check Out

This board is meant to be a place to kick any interesting tools I come across as well as any notes related to that tool as I check them out.

SQL Mesh

What Is it? Per there website:

SQLMesh is an open source data transformation framework that brings the best practices of DevOps to data teams. It enables data scientists, analysts, and engineers to efficiently run and deploy data transformations written in SQL or Python. It is created and maintained by Tobiko Data, a company founded by data leaders from Airbnb, Apple, and Netflix.

Why am I interested?

It seems like an interesting alternative to a tool like synapse or Fabric.

Plot Nine

Python Plotting

Satyrn

Satyrn is a Mac based alternative to Jupyter. Some nice features I've found so far are:

  • It is a really clean interface with few distractions
  • They use most of the same keyboard shortcuts as Jupyter
  • They list their keyboard shortcuts right away in a intro notebook
  • The auto-complete is really fast, and the tool seems pretty quick too.

Some things that have been a bit of a challenge:

  • I ended up setting a venv and then getting the path to the bin to get a kernel with some packages (wouldn't mind having a default environment with the option to pip install)
  • Not entirely sure how to use black (set up to the path, but didn't notice the keyboard shortcut).

Things to Try

  • Get a key to ChatGPT and try the built int help

Tauri

Electron with a rust backend

Marimo

Reactive notebook option.

Supabase

postgres database service with auth.

Great Expectations

Data quality validation tool.

Kolo

Invert a trace and get a working integration test in fifteen minutes.

Difftastic

Difftastic is a CLI diff tool that compares files based on their syntax, not line-by-line. Difftastic produces accurate diffs that are easier for humans to read.

Quarto

An open-source scientific and technical publishing system

SQL Flow

SQL visualization tool

FastHTML

Modern web applications in pure Python

LanceDB

LanceDB is an open-source vector database for AI that's designed to store, manage, query and retrieve embeddings on large-scale multi-modal data. The core of LanceDB is written in Rust ๐Ÿฆ€ and is built on top of Lance, an open-source columnar data format designed for performant ML workloads and fast random access.

Sanity RSS Plugin

Sanity RSS Plugin

Datasette

Datasette is a tool for exploring and publishing data. It helps people take data of any shape, analyze and explore it, and publish it as an interactive website and accompanying API.

SeaweedFS

Potential local S3 option.

Deltabase

Polar + delta lake

Stumpy

Time series analysis

LLMIO

llm io

SQLGlot

SQLGlot is a no-dependency SQL parser, transpiler, optimizer, and engine. It can be used to format SQL or translate between 21 different dialects like DuckDB, Presto / Trino, Spark / Databricks, Snowflake, and BigQuery. It aims to read a wide variety of SQL inputs and output syntactically and semantically correct SQL in the targeted dialects.

Fire Ducks

Compiler Accelerated DataFrame Library for Python with fully-compatible pandas API

Pyper

Concurrent Python made simple

SQL Flow

DuckDB for streaming data

SparkDQ

Most data quality frameworks werenโ€™t designed with PySpark in mind. They arenโ€™t Spark-native and often lack proper support for declarative pipelines. Instead of integrating seamlessly, they require you to build custom wrappers around them just to fit into production workflows. This adds complexity and makes your pipelines harder to maintain. On top of that, many frameworks only validate data after processing โ€” so you canโ€™t react dynamically or fail early when data issues occur.

Patito

Patito offers a simple way to declare pydantic data models which double as schema for your polars data frames. These schema can be used for:

๐Ÿ‘ฎ Simple and performant data frame validation.๐Ÿงช Easy generation of valid mock data frames for tests.๐Ÿ Retrieve and represent singular rows in an object-oriented manner.๐Ÿง  Provide a single source of truth for the core data models in your code base.

Dataframely

Dataframely is a Python package to validate the schema and content of polars data frames. Its purpose is to make data pipelines more robust by ensuring that data meet expectations and more readable by adding schema information to data frame type hints.

DuckLake

TL;DR: DuckLake simplifies lakehouses by using a standard SQL database for all metadata, instead of complex file-based systems, while still storing data in open formats like Parquet. This makes it more reliable, faster, and easier to manage.

Posts

My current favorite image of myself per my 8 yo daughter