description DataHub Overview
Originally developed at LinkedIn, DataHub is the premier open-source metadata platform. It is built to handle massive scale and provides a highly extensible architecture for organizations that want to build custom metadata solutions. DataHub excels at capturing technical metadata from a wide variety of sources and provides a robust API for developers to extend its functionality. It is ideal for engineering-heavy organizations that prefer to own their infrastructure and want to avoid vendor lock-in while maintaining a high-performance data catalog.
info DataHub Specifications
| Api | GraphQL, REST |
| Backend | Java (Spring Boot) |
| Platform | Kubernetes, Docker |
| Ingestion | Python SDK |
| Search Engine | Elasticsearch |
| Authentication | OIDC, LDAP, OAuth 2.0 |
| Event Streaming | Apache Kafka |
| Metadata Storage | MySQL, PostgreSQL, Elasticsearch |
| Deployment Options | Self-hosted (Open Source), Managed Cloud |
| Supported Data Platforms | 100+ integrations including Snowflake, Databricks, BigQuery, Redshift |
balance DataHub Pros & Cons
- Battle-tested architecture originally built and used at LinkedIn for massive-scale metadata management
- Highly extensible plugin system allowing custom metadata ingestion, storage, and retrieval solutions
- Strong community support with active development and regular releases
- Comprehensive support for diverse data sources including Kafka, Snowflake, Databricks, and major cloud providers
- Enterprise-grade search capabilities powered by Elasticsearch with GraphQL and REST APIs
- Open-source with no licensing costs, enabling full customization and self-hosting
- Steeper learning curve requiring significant time investment to understand architecture and deployment
- Resource-intensive deployment requiring substantial infrastructure for production workloads
- Documentation gaps and inconsistent examples make onboarding challenging for new users
- Some advanced features and enterprise integrations require additional development effort
- UI/UX can feel complex and overwhelming for non-technical business users
help DataHub FAQ
How do I install and deploy DataHub in production?
DataHub supports Docker Compose for quick setups and Kubernetes for production deployments. The recommended approach uses Helm charts on Kubernetes with separate services for ingestion, search, and metadata storage. Initial setup typically takes 2-4 hours for basic configurations.
What data sources and systems does DataHub support?
DataHub supports over 100 native integrations including Snowflake, Databricks, BigQuery, Kafka, AWS Glue, dbt, Tableau, and Looker. Custom integrations can be built using DataHub's Python or Java ingestion framework.
Is DataHub completely free to use?
DataHub's core platform is 100% open-source and free under the Apache 2.0 license. Acryl Data offers DataHub Cloud as a managed SaaS option with additional enterprise features and support plans.
How does DataHub handle metadata versioning and lineage?
DataHub captures metadata changes as events in Kafka, enabling full versioning history. It supports dataset, column, and process lineage through automated ingestion and manual propagation, displaying relationships in an interactive graph UI.
What programming languages and frameworks is DataHub built with?
DataHub's backend is primarily Java (Spring Boot) with a Python-based ingestion framework. The frontend uses React with Apollo GraphQL client, while Elasticsearch powers search and MySQL/PostgreSQL store core metadata.
What is DataHub?
How good is DataHub?
How much does DataHub cost?
What are the best alternatives to DataHub?
What is DataHub best for?
Large enterprises and data engineering teams seeking a comprehensive, open-source metadata platform to centralize data discovery, lineage, and governance across complex data ecosystems.
How does DataHub compare to Apache Nifi?
Is DataHub worth it in 2026?
What are the key specifications of DataHub?
- API: GraphQL, REST
- Backend: Java (Spring Boot)
- Platform: Kubernetes, Docker
- Ingestion: Python SDK
- Search Engine: Elasticsearch
- Authentication: OIDC, LDAP, OAuth 2.0
explore Explore More
Similar to DataHub
See all arrow_forwardReviews & Comments
Write a Review
Be the first to review
Share your thoughts with the community and help others make better decisions.