This board is meant to be a place to kick any interesting tools I come across as well as any notes related to that tool as I check them out.
SQL Mesh
What Is it? Per there website:
SQLMesh is an open source data transformation framework that brings the best practices of DevOps to data teams. It enables data scientists, analysts, and engineers to efficiently run and deploy data transformations written in SQL or Python. It is created and maintained by Tobiko Data, a company founded by data leaders from Airbnb, Apple, and Netflix.
Why am I interested?
It seems like an interesting alternative to a tool like synapse or Fabric.
Satyrn
Satyrn is a Mac based alternative to Jupyter. Some nice features I've found so far are:
- It is a really clean interface with few distractions
- They use most of the same keyboard shortcuts as Jupyter
- They list their keyboard shortcuts right away in a intro notebook
- The auto-complete is really fast, and the tool seems pretty quick too.
Some things that have been a bit of a challenge:
- I ended up setting a venv and then getting the path to the bin to get a kernel with some packages (wouldn't mind having a default environment with the option to pip install)
- Not entirely sure how to use black (set up to the path, but didn't notice the keyboard shortcut).
Things to Try
- Get a key to ChatGPT and try the built int help
Tauri
Electron with a rust backend
Marimo
Reactive notebook option.
Supabase
postgres database service with auth.
Kolo
Invert a trace and get a working integration test in fifteen minutes.
Difftastic
Difftastic is a CLI diff tool that compares files based on their syntax, not line-by-line. Difftastic produces accurate diffs that are easier for humans to read.
Quarto
An open-source scientific and technical publishing system
FastHTML
Modern web applications in pure Python
LanceDB
LanceDB is an open-source vector database for AI that's designed to store, manage, query and retrieve embeddings on large-scale multi-modal data. The core of LanceDB is written in Rust ๐ฆ and is built on top of Lance, an open-source columnar data format designed for performant ML workloads and fast random access.
Datasette
Datasette is a tool for exploring and publishing data. It helps people take data of any shape, analyze and explore it, and publish it as an interactive website and accompanying API.
Fire Ducks
Compiler Accelerated DataFrame Library for Python with fully-compatible pandas API
Pyper
Concurrent Python made simple
SparkDQ
Most data quality frameworks werenโt designed with PySpark in mind. They arenโt Spark-native and often lack proper support for declarative pipelines. Instead of integrating seamlessly, they require you to build custom wrappers around them just to fit into production workflows. This adds complexity and makes your pipelines harder to maintain. On top of that, many frameworks only validate data after processing โ so you canโt react dynamically or fail early when data issues occur.
Patito
Patito offers a simple way to declare pydantic data models which double as schema for your polars data frames. These schema can be used for:
๐ฎ Simple and performant data frame validation.๐งช Easy generation of valid mock data frames for tests.๐ Retrieve and represent singular rows in an object-oriented manner.๐ง Provide a single source of truth for the core data models in your code base.
Dataframely
Dataframely is a Python package to validate the schema and content of polars data frames. Its purpose is to make data pipelines more robust by ensuring that data meet expectations and more readable by adding schema information to data frame type hints.
DuckLake
TL;DR: DuckLake simplifies lakehouses by using a standard SQL database for all metadata, instead of complex file-based systems, while still storing data in open formats like Parquet. This makes it more reliable, faster, and easier to manage.