Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Better large output display in datafusion-cli with --maxrows option #7617

Merged
merged 2 commits into from
Sep 22, 2023

Conversation

2010YOUY01
Copy link
Contributor

Which issue does this PR close?

NA

Rationale for this change

If a query outputs a large number of lines, currently datafusion-cli will display them all and flood the terminal.
This PR added a --maxrows option to limit max number of rows for output display.

--maxrows <MAXROWS>
            The max number of rows to display for 'Table' format
            [default: 40] [possible values: numbers(0/10/...), inf(no limit)]
arrow-datafusion/datafusion-cli|cli-large-output🌝| >>> cargo run --release -- --maxrows 5
    Finished release [optimized] target(s) in 0.30s
     Running `target/release/datafusion-cli --maxrows 5`
DataFusion CLI v31.0.0
❯ CREATE EXTERNAL TABLE lineitem STORED AS PARQUET LOCATION '/Users/yongting/Desktop/code/my_datafusion/arrow-datafusion/benchmarks/data/tpch_sf10/lineitem/part-0.parquet';
0 rows in set. Query took 0.041 seconds.

❯ select l_orderkey, l_partkey from lineitem;
+------------+-----------+
| l_orderkey | l_partkey |
+------------+-----------+
| 1          | 1551894   |
| 1          | 673091    |
| 1          | 636998    |
| 1          | 21315     |
| 1          | 240267    |
| .                      |
| .                      |
| .                      |
+------------+-----------+
59986052 rows in set (5 shown). Query took 2.451 seconds.

What changes are included in this PR?

  1. Change elapsed time measurement in CLI to (query start -> all result ready in memory in [RecordBatch]), before it also counts the time for formatting and printing the results.
  2. Add --maxrows option, now only Table CLI display format is supported.

Are these changes tested?

Unit tests

Are there any user-facing changes?

No

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @2010YOUY01 -- this is great. I tried it out locally and the code looks good to me.

I think we could encapsulate the code a bit and make a MaxRows type structure like this that would make parsing a little less awkward:

struct Args { 
...
    #[clap(
        long,
        help = "The max number of rows to display for 'Table' format\n[default: 40] [possible values: numbers(0/10/...), inf(no limit)]",
        default_value = "40",
    )]
    maxrows: MaxRows,
}

With a structure like

#[derive(Debug, Clone)]
enum MaxRows {
    /// show all rows in the output
    Unlimited,
    /// Only show n rows
    Limited(usize)
}

impl FromStr for MaxRows {
    type Err = String;

    fn from_str(maxrows: &str) -> Result<Self, Self::Err> {
        if maxrows.to_lowercase() == "inf"
            || maxrows.to_lowercase() == "infinite"
            || maxrows.to_lowercase() == "none"
        {
            Ok(Self::Unlimited)
        } else {
            match maxrows.parse::<usize>() {
                Ok(nrows)  => Ok(Self::Limited(nrows)),
                _ => Err(format!("Invalid maxrows {}. Valid inputs are natural numbers or \'inf\' for no limit.", maxrows)),
            }
        }
    }
}

impl Display for MaxRows {
    fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
        match self {
            Self::Unlimited =>  write!(f, "unlimited"),
            Self::Limited(max_rows) => write!(f, "at most max_rows"),
        }
    }
}

} else {
match maxrows.parse::<usize>() {
Ok(nrows) => Ok(Some(nrows)),
_ => Err(format!("Invalid maxrows {}. Valid inputs are natural numbers or \'inf\' for no limit.", maxrows)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_ => Err(format!("Invalid maxrows {}. Valid inputs are natural numbers or \'inf\' for no limit.", maxrows)),
_ => Err(format!("Invalid maxrows {}. Valid inputs are natural numbers or \'none\', \'inf\', or \'infinite\' for no limit.", maxrows)),

@2010YOUY01
Copy link
Contributor Author

I think we could encapsulate the code a bit and make a MaxRows type structure like this that would make parsing a little less awkward:

Thank you for the suggestions
Besides enum MaxRows for cleaner implementation, I also let explain/analyze/describe ignore --maxrows option

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great -- thank you @2010YOUY01

@@ -211,6 +211,15 @@ async fn exec_and_print(
let statements = DFParser::parse_sql_with_dialect(&sql, dialect.as_ref())?;
for statement in statements {
let plan = ctx.state().statement_to_plan(statement).await?;

// For plans like `Explain` ignore `MaxRows` option and always display all rows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb merged commit 963680d into apache:main Sep 22, 2023
22 checks passed
@2010YOUY01 2010YOUY01 deleted the cli-large-output branch September 22, 2023 21:03
Ted-Jiang pushed a commit to Ted-Jiang/arrow-datafusion that referenced this pull request Oct 7, 2023
…tion (apache#7617)

* Better large output display for CLI

* review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants