/jump to repo

Repos Recent Bookmarks Watched Notes Tags Discover Compare Stats GitHub

Repository brief

apache/spark

Read the upstream summary on the left, browse the cached forks below it, and load each fork comparison into the right-hand panel.

Cached analysis

cached 2026-03-30T15:56:02.063Z

1mo ago

apache/spark

Apache Spark is a large, active Apache project for large-scale data processing and unified analytics. It supports Scala, Java, Python, and R (deprecated) and includes Spark SQL, pandas API on Spark, MLlib, GraphX, and Structured Streaming. The repo is very mature and heavily forked, with 29,139 forks and 43,059 stars, and it was updated/pushed on 2026-03-30.

Loading tags...

Stars43,059

Forks29,139

Default branchmaster

Last pushed2026-03-30T14:00:27Z

Recommended shortcuts

Jump straight into Discofork's strongest cached fork picks, or open a compare view in one click.

Forks

Choose a fork to inspect

10 of 10 fork briefs

Maintenance:

Magnitude:

Sort:

Selected

Choose this fork if you want Palantir-specific Spark behavior and can live with an older, highly diverged codebase. Choose upstream Spark if you want current features, easier upgrades, and the broadest community support.

Prefer upstream Spark unless you specifically need this fork's legacy StreamSQL/Kafka streaming extensions and are willing to maintain a heavily outdated, highly divergent codebase yourself.

Choose this fork only if you need its legacy 1.1.x behavior or custom integrations. For most adopters, upstream Apache Spark is the better choice because this fork is stale, highly divergent, and missing modern Spark capabilities.

Choose this fork only if GPU acceleration is the primary requirement and you can absorb the maintenance burden. For most users, upstream Spark is the safer default because this fork is stale and materially behind.

Prefer this fork only if you need its older Hive/Spark compatibility and are willing to maintain a heavily lagging Spark branch. For most adopters, upstream Apache Spark is the safer choice because this fork is stale and likely missing many newer APIs, fixes, and usability improvements.

Choose this fork only if you need an old, historical Spark baseline. For active development, production use, or modern Spark features, upstream is the better choice by a wide margin.

Prefer this fork only if you need an old, frozen Spark baseline. If you want current Spark features, compatibility, or ongoing maintenance, upstream is the better choice by a wide margin.

Choose this fork only if you need legacy MapR integration and can accept an old Spark baseline. For anyone starting fresh or wanting current Spark features, upstream Apache Spark is the better fit.

Prefer this fork only if AWS Fargate serverless deployment is the primary requirement and you can accept a frozen, highly divergent Spark codebase. If you need current Spark features, compatibility, or active upstream support, upstream Apache Spark is the safer choice.

Prefer this fork only if you need its legacy compatibility and custom patches and can accept a large gap from active Apache Spark development. If you want current Spark features, fixes, and ecosystem compatibility, upstream is the better choice.

Fork comparison

palantir/spark

38/100

stale

significant_divergence

Choose this fork if you want Palantir-specific Spark behavior and can live with an older, highly diverged codebase. Choose upstream Spark if you want current features, easier upgrades, and the broadest community support.

Likely purpose

Provide a Palantir-specific Spark distribution with custom SQL/PySpark behavior, compatibility adjustments, and internal build/runtime changes for a controlled downstream deployment.

Best for

Teams already standardized on Palantir’s Spark distribution; Users who need downstream behavior stability more than upstream freshness; Organizations that need customized SQL/PySpark semantics or protocol compatibility; Operators willing to accept upgrade lag in exchange for a controlled fork

Additional features

Custom row-field reordering to match schema
Adjusted AQE repartition/coalesce behavior
Broader Hive ThriftServer compatibility via generated service bindings for multiple protocol versions
Fork-specific SQL and PySpark behavior changes around functions, pandas frame handling, and error reporting
Expands API, schema, or SDK surface area for downstream integrations and generated clients.
Introduces more operational workflow surface such as server-side handling, auth, session control, or admin tooling.

Missing features

It is materially behind upstream Spark, so it likely lacks many newer Apache Spark 2026 fixes and features
Large API and documentation areas appear pruned or heavily diverged, including PySpark SQL function surfaces, pandas API files, registry metadata, and the quickstart notebook
Recent upstream SQL UI and core improvements are not reflected here, including newer metadata/loading and SQL-tab enhancements
It trails upstream by 200 commits, so some recent upstream features and fixes are likely not present yet.

Strengths

Strong downstream customization for a specific deployment environment
Large codebase coverage across JVM, Python, SQL, and Hive ThriftServer layers
Likely better fit when you need pinned, controlled behavior instead of fast-moving upstream churn

Risks

Stale activity suggests limited maintenance velocity
Large divergence raises merge and upgrade cost
Being behind upstream increases the chance of missing security fixes, bug fixes, and new APIs
Pruned or rewritten surfaces may surprise users expecting vanilla Spark behavior