arrow.apache.org/blog/2022/10/25/datafusion-13.0.0

Preview meta tags from the arrow.apache.org website.

Linked Hostnames

Thumbnail

Search Engine Appearance

Google

https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0

Apache Arrow DataFusion 13.0.0 Project Update

Introduction Apache Arrow DataFusion 13.0.0 is released, and this blog contains an update on the project for the 5 months since our last update in May 2022. DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project to: Support SQL support Support DataFrame API Support a Domain Specific Query Language Easily and quickly read and process Parquet, JSON, Avro or CSV data. Read from remote object stores such as AWS S3, Azure Blob Storage, GCP. Even though DataFusion is 4 years “young,” it has seen significant community growth in the last few months and the momentum continues to accelerate. Background DataFusion is used as the engine in many open source and commercial projects and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a “LLVM for database and AI systems”(alternate link) with announcements such as the release of FaceBook’s Velox engine, the major investments in Acero as well as the continued popularity of Apache Calcite and other similar technologies. While Velox and Acero focus on execution engines, DataFusion provides the entire suite of components needed to build most analytic systems, including a SQL frontend, a dataframe API, and extension points for just about everything. Some DataFusion users use a subset of the features such as the frontend (e.g. dask-sql) or the execution engine, (e.g. Blaze), and some use many different components to build both SQL based and customized DSL based systems such as InfluxDB IOx and VegaFusion. One of DataFusion’s advantages is its implementation in Rust and thus its easy integration with the broader Rust ecosystem. Rust continues to be a major source of benefit, from the ease of parallelization with the high quality and standardized async ecosystem , as well as its modern dependency management system and wonderful performance. Summary We have increased the frequency of DataFusion releases to monthly instead of quarterly. This makes it easier for the increasing number of projects that now depend on DataFusion. We have also completed the “graduation” of Ballista to its own top-level arrow-ballista repository which decouples the two projects and allows each project to move even faster. Along with numerous other bug fixes and smaller improvements, here are some of the major advances: Improved Support for Cloud Object Stores DataFusion now supports many major cloud object stores (Amazon S3, Azure Blob Storage, and Google Cloud Storage) “out of the box” via the object_store crate. Using this integration, DataFusion optimizes reading parquet files by reading only the parts of the files that are needed. Advanced SQL DataFusion now supports correlated subqueries, by rewriting them as joins. See the Subquery page in the User Guide for more information. In addition to numerous other small improvements, the following SQL features are now supported: ROWS, RANGE, PRECEDING and FOLLOWING in OVER clauses #3570 ROLLUP and CUBE grouping set expressions #2446 SUM DISTINCT aggregate support #2405 IN and NOT IN Subqueries by rewriting them to SEMI / ANTI #2421 Non equality predicates in ON clause of LEFT, RIGHT, and FULL joins #2591 Exact MEDIAN #3009 GROUPING SETS/CUBE/ROLLUP #2716 More DDL Support Just as it is important to query, it is also important to give users the ability to define their data sources. We have added: CREATE VIEW #2279 DESCRIBE <table> #2642 Custom / Dynamic table provider factories #3311 SHOW CREATE TABLE for support for views #2830 Faster Execution Performance is always an important goal for DataFusion, and there are a number of significant new optimizations such as Optimizations of TopK (queries with a LIMIT or OFFSET clause): #3527, #2521 Reduce left/right/full joins to inner join #2750 Convert cross joins to inner joins when possible #3482 Sort preserving SortMergeJoin #2699 Improvements in group by and sort performance #2375 Adaptive regex_replace implementation #3518 Optimizer Enhancements Internally the optimizer has been significantly enhanced as well. Casting / coercion now happens during logical planning #3185 #3636 More sophisticated expression analysis and simplification is available Parquet The parquet reader can now read directly from parquet files on remote object storage #2489 #3051 Experimental support for “predicate pushdown” with late materialization after filtering during the scan (another blog post on this topic is coming soon). Support reading directly from AWS S3 and other object stores via datafusion-cli #3631 DataType Support Support for TimestampTz #3660 Expanded support for the Decimal type, including IN list and better built in coercion. Expanded support for date/time manipulation such as date_bin built-in function , timestamp +/- interval, TIME literal values #3010, #3110, #3034 Binary operations (AND, XOR, etc): #3037 #3420 IS TRUE/FALSE and IS [NOT] UNKNOWN #3235, #3246 Upcoming Work With the community growing and code accelerating, there is so much great stuff on the horizon. Some features we expect to land in the next few months: Complete Parquet Pushdown Additional date/time support Cost models, Nested Join Optimizations, analysis framework #128, #3843, #3845 Community Growth The DataFusion 9.0.0 and 13.0.0 releases consists of 433 PRs from 64 distinct contributors. This does not count all the work that goes into our dependencies such as arrow, parquet, and object_store, that much of the same community helps nurture. How to Get Involved Kudos to everyone in the community who contributed ideas, discussions, bug reports, documentation and code. It is exciting to be building something so cool together! If you are interested in contributing to DataFusion, we would love to have you join us on our journey to create the most advanced open source query engine. You can try out DataFusion on some of your own data and projects and let us know how it goes or contribute a PR with documentation, tests or code. A list of open issues suitable for beginners is here. Check out our Communication Doc on more ways to engage with the community. Appendix: Contributor Shoutout To give a sense of the number of people who contribute to this project regularly, we present for your consideration the following list derived from git shortlog -sn 9.0.0..13.0.0 . Thank you all again! 87 Andy Grove 71 Andrew Lamb 29 Kun Liu 29 Kirk Mitchener 17 Wei-Ting Kuo 14 Yang Jiang 12 Raphael Taylor-Davies 11 Batuhan Taskaya 10 Brent Gardner 10 Remzi Yang 10 comphead 10 xudong.w 8 AssHero 7 Ruihang Xia 6 Dan Harris 6 Daniël Heres 6 Ian Alexander Joiner 6 Mike Roberts 6 askoa 4 BaymaxHWY 4 gorkem 4 jakevin 3 George Andronchik 3 Sarah Yurick 3 Stuart Carnie 2 Dalton Modlin 2 Dmitry Patsura 2 JasonLi 2 Jon Mease 2 Marco Neumann 2 yahoNanJing 1 Adilet Sarsembayev 1 Ayush Dattagupta 1 Dezhi Wu 1 Dhamotharan Sritharan 1 Eduard Karacharov 1 Francis Du 1 Harbour Zheng 1 Ismaël Mejía 1 Jack Klamer 1 Jeremy Dyer 1 Jiayu Liu 1 Kamil Konior 1 Liang-Chi Hsieh 1 Martin Grigorov 1 Matthijs Brobbel 1 Mehmet Ozan Kabak 1 Metehan Yıldırım 1 Morgan Cassels 1 Nitish Tiwari 1 Renjie Liu 1 Rito Takeuchi 1 Robert Pack 1 Thomas Cameron 1 Vrishabh 1 Xin Hao 1 Yijie Shen 1 byteink 1 kamille 1 mateuszkj 1 nvartolomei 1 yourenawo 1 Özgür Akkurt

Bing

Apache Arrow DataFusion 13.0.0 Project Update

https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0

DuckDuckGo

https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0

Apache Arrow DataFusion 13.0.0 Project Update

General Meta Tags
10
- title
  Apache Arrow DataFusion 13.0.0 Project Update | Apache Arrow
- charset
  UTF-8
- X-UA-Compatible
  IE=edge
- viewport
  width=device-width, initial-scale=1
- generator
  Jekyll v4.3.3
Open Graph Meta Tags
7
- og:title
  Apache Arrow DataFusion 13.0.0 Project Update
- og:locale
  en_US
- og:description
  Introduction Apache Arrow DataFusion 13.0.0 is released, and this blog contains an update on the project for the 5 months since our last update in May 2022. DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project to: Support SQL support Support DataFrame API Support a Domain Specific Query Language Easily and quickly read and process Parquet, JSON, Avro or CSV data. Read from remote object stores such as AWS S3, Azure Blob Storage, GCP. Even though DataFusion is 4 years “young,” it has seen significant community growth in the last few months and the momentum continues to accelerate. Background DataFusion is used as the engine in many open source and commercial projects and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a “LLVM for database and AI systems”(alternate link) with announcements such as the release of FaceBook’s Velox engine, the major investments in Acero as well as the continued popularity of Apache Calcite and other similar technologies. While Velox and Acero focus on execution engines, DataFusion provides the entire suite of components needed to build most analytic systems, including a SQL frontend, a dataframe API, and extension points for just about everything. Some DataFusion users use a subset of the features such as the frontend (e.g. dask-sql) or the execution engine, (e.g. Blaze), and some use many different components to build both SQL based and customized DSL based systems such as InfluxDB IOx and VegaFusion. One of DataFusion’s advantages is its implementation in Rust and thus its easy integration with the broader Rust ecosystem. Rust continues to be a major source of benefit, from the ease of parallelization with the high quality and standardized async ecosystem , as well as its modern dependency management system and wonderful performance. Summary We have increased the frequency of DataFusion releases to monthly instead of quarterly. This makes it easier for the increasing number of projects that now depend on DataFusion. We have also completed the “graduation” of Ballista to its own top-level arrow-ballista repository which decouples the two projects and allows each project to move even faster. Along with numerous other bug fixes and smaller improvements, here are some of the major advances: Improved Support for Cloud Object Stores DataFusion now supports many major cloud object stores (Amazon S3, Azure Blob Storage, and Google Cloud Storage) “out of the box” via the object_store crate. Using this integration, DataFusion optimizes reading parquet files by reading only the parts of the files that are needed. Advanced SQL DataFusion now supports correlated subqueries, by rewriting them as joins. See the Subquery page in the User Guide for more information. In addition to numerous other small improvements, the following SQL features are now supported: ROWS, RANGE, PRECEDING and FOLLOWING in OVER clauses #3570 ROLLUP and CUBE grouping set expressions #2446 SUM DISTINCT aggregate support #2405 IN and NOT IN Subqueries by rewriting them to SEMI / ANTI #2421 Non equality predicates in ON clause of LEFT, RIGHT, and FULL joins #2591 Exact MEDIAN #3009 GROUPING SETS/CUBE/ROLLUP #2716 More DDL Support Just as it is important to query, it is also important to give users the ability to define their data sources. We have added: CREATE VIEW #2279 DESCRIBE <table> #2642 Custom / Dynamic table provider factories #3311 SHOW CREATE TABLE for support for views #2830 Faster Execution Performance is always an important goal for DataFusion, and there are a number of significant new optimizations such as Optimizations of TopK (queries with a LIMIT or OFFSET clause): #3527, #2521 Reduce left/right/full joins to inner join #2750 Convert cross joins to inner joins when possible #3482 Sort preserving SortMergeJoin #2699 Improvements in group by and sort performance #2375 Adaptive regex_replace implementation #3518 Optimizer Enhancements Internally the optimizer has been significantly enhanced as well. Casting / coercion now happens during logical planning #3185 #3636 More sophisticated expression analysis and simplification is available Parquet The parquet reader can now read directly from parquet files on remote object storage #2489 #3051 Experimental support for “predicate pushdown” with late materialization after filtering during the scan (another blog post on this topic is coming soon). Support reading directly from AWS S3 and other object stores via datafusion-cli #3631 DataType Support Support for TimestampTz #3660 Expanded support for the Decimal type, including IN list and better built in coercion. Expanded support for date/time manipulation such as date_bin built-in function , timestamp +/- interval, TIME literal values #3010, #3110, #3034 Binary operations (AND, XOR, etc): #3037 #3420 IS TRUE/FALSE and IS [NOT] UNKNOWN #3235, #3246 Upcoming Work With the community growing and code accelerating, there is so much great stuff on the horizon. Some features we expect to land in the next few months: Complete Parquet Pushdown Additional date/time support Cost models, Nested Join Optimizations, analysis framework #128, #3843, #3845 Community Growth The DataFusion 9.0.0 and 13.0.0 releases consists of 433 PRs from 64 distinct contributors. This does not count all the work that goes into our dependencies such as arrow, parquet, and object_store, that much of the same community helps nurture. How to Get Involved Kudos to everyone in the community who contributed ideas, discussions, bug reports, documentation and code. It is exciting to be building something so cool together! If you are interested in contributing to DataFusion, we would love to have you join us on our journey to create the most advanced open source query engine. You can try out DataFusion on some of your own data and projects and let us know how it goes or contribute a PR with documentation, tests or code. A list of open issues suitable for beginners is here. Check out our Communication Doc on more ways to engage with the community. Appendix: Contributor Shoutout To give a sense of the number of people who contribute to this project regularly, we present for your consideration the following list derived from git shortlog -sn 9.0.0..13.0.0 . Thank you all again! 87 Andy Grove 71 Andrew Lamb 29 Kun Liu 29 Kirk Mitchener 17 Wei-Ting Kuo 14 Yang Jiang 12 Raphael Taylor-Davies 11 Batuhan Taskaya 10 Brent Gardner 10 Remzi Yang 10 comphead 10 xudong.w 8 AssHero 7 Ruihang Xia 6 Dan Harris 6 Daniël Heres 6 Ian Alexander Joiner 6 Mike Roberts 6 askoa 4 BaymaxHWY 4 gorkem 4 jakevin 3 George Andronchik 3 Sarah Yurick 3 Stuart Carnie 2 Dalton Modlin 2 Dmitry Patsura 2 JasonLi 2 Jon Mease 2 Marco Neumann 2 yahoNanJing 1 Adilet Sarsembayev 1 Ayush Dattagupta 1 Dezhi Wu 1 Dhamotharan Sritharan 1 Eduard Karacharov 1 Francis Du 1 Harbour Zheng 1 Ismaël Mejía 1 Jack Klamer 1 Jeremy Dyer 1 Jiayu Liu 1 Kamil Konior 1 Liang-Chi Hsieh 1 Martin Grigorov 1 Matthijs Brobbel 1 Mehmet Ozan Kabak 1 Metehan Yıldırım 1 Morgan Cassels 1 Nitish Tiwari 1 Renjie Liu 1 Rito Takeuchi 1 Robert Pack 1 Thomas Cameron 1 Vrishabh 1 Xin Hao 1 Yijie Shen 1 byteink 1 kamille 1 mateuszkj 1 nvartolomei 1 yourenawo 1 Özgür Akkurt
- og:url
  https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/
- og:site_name
  Apache Arrow
Twitter Meta Tags
1
- twitter:card
  summary_large_image
Link Tags
16
- alternate
  https://arrow.apache.org/feed.xml
- apple-touch-icon
  /img/apple-touch-icon.png
- apple-touch-icon
  /img/apple-touch-icon-120x120.png
- apple-touch-icon
  /img/apple-touch-icon-76x76.png
- apple-touch-icon
  /img/apple-touch-icon-60x60.png

Links

115

arrow.apache.org/blog/2022/10/25/datafusion-13.0.0

Linked Hostnames

Thumbnail

Search Engine Appearance

Google

Apache Arrow DataFusion 13.0.0 Project Update

Bing

Apache Arrow DataFusion 13.0.0 Project Update

DuckDuckGo

Apache Arrow DataFusion 13.0.0 Project Update

General Meta Tags

Open Graph Meta Tags

Twitter Meta Tags

Link Tags

Links