Add pub method to the BatchedParquetReader to read row group directly #20620

deanm0000 · 2025-01-08T11:50:01Z

Description

I often save big parquet files with BatchedWriter forcing each row group to contain only a single node_id. When reading those files I use the ParquetAsyncReader to get the row group metadata so that I know which node_ids are in the file. For reference I do this

    let mut async_reader = ParquetAsyncReader::from_uri(cloud_path,None,None).await.unwrap();
    let metadata =async_reader.get_metadata().await.unwrap();
    let meta_cloned = Arc::clone(&metadata);
    let row_groups = &meta_cloned.row_groups;
    let mut nodes:Vec<u64> = vec![];
    row_groups.into_iter().for_each(|(i, rg)| {
        let columns = rg.columns_under_root_iter("node_id").unwrap();
        columns.into_iter().for_each(|col| {
            let c=col.metadata();
            let stats=c.statistics.clone().unwrap();
            let min=stats.min_value.clone().unwrap();
            let array: [u8; 8] = min.try_into().expect("Vec<u8> is not 8 bytes long.");
            let node=u64::from_le_bytes(array);
            nodes.push(node);
        });
    });

As it is now I can't (or don't see how) to use the same async_reader to read the data from each row group. Instead I have to do a LazyFrame::scan_parquet with a filter.

The async_reader already has a RowGroupFetcher which has fetch_row_groups but they're not public so can't be used directly.

This would (I think) also alleviate the need to use bigidx since none of my row groups are that big.

The text was updated successfully, but these errors were encountered:

deanm0000 added the enhancement New feature or an improvement of an existing feature label Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pub method to the BatchedParquetReader to read row group directly #20620

Add pub method to the BatchedParquetReader to read row group directly #20620

deanm0000 commented Jan 8, 2025 •

edited

Loading

Add pub method to the BatchedParquetReader to read row group directly #20620

Add pub method to the BatchedParquetReader to read row group directly #20620

Comments

deanm0000 commented Jan 8, 2025 • edited Loading

Description

deanm0000 commented Jan 8, 2025 •

edited

Loading