-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DISCUSSION: Data for pandas examples #150
Comments
@datapythonista I really like this idea! As a pandas user, I completely agree that its difficult to understand the documentation of some functions when the example is based on random numbers. One question: are you thinking of using examples already uploaded somewhere in the internet, or creating the examples ourselves? (or are we open to both ideas?) |
I'm open to any idea. But my personal opinion is that will make things easier for users if we use datasets as small as possible. Showing |
I am thinking of a central theme and multiple datasets around it. That will make the documentation read like a story for different use cases. For eg., A students dataset, A HR system or A sales system. I have a pandas article in a similar fashion for your reference. |
Based on the idea of @bhavaniravi of keeping the same theme across the three different types of dataframes that @datapythonista mentioned, I came up with the following examples (ignore the actual numbers in the dataframe, they won't make sense): We could create a multiindex dataframe taking as example a company with multiple branches in differents cities, that wants to know how each of their sales departments performed in 4 profit variables (units sold, total revenue, cost of produced goods and operating costs), measured yearly since 2013:
We could obtain a simple version of the dataframe by keeping one of these branches and its profit variables during a specific year, e.g.:
And finally, the timeseries data could be obtained by keeping only one of these variables across the years (in this example it would be units sold across the years):
Of course it doesn't have to be this particular example :) I guess the main poiint is that from the most complex dataframe (the one with multi-index) you can derive the two others (the simple and the time-series ones) |
I agree @martinagvilas, I think it's also helpful to have the same dataset to represent a complex dataframe, a simpler one, and other variations. I imagine a lot of people would be interested in transforming one to the other as well. |
Very often in the pandas documentation, to show examples simple
DataFrame
objects are created. And many of them just use random data, see for example https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#id1Then, if I want to show an operation, I can get something like:
And in my opinion the example is quite useless (more than for the syntax), because if you don't know what the operation does, the example is not helping you understand.
The best example I could find to overcome that (probably not great, but the best I could find) is:
Then, when performing an operation is easy to guess what it's doing, or double check if you already have a guess:
We are already using some of those in some examples: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename_axis.html
While this worked well in some places, we found this dataset very insufficient to show all pandas functionality. And while we initially wanted to standardize the data used in the examples, so things are easier for recurring users, we finally forgot about it.
But while it's surely not simple, I think it'd be ideal if we could find a very reduced amount of datasets that can be used in all pandas examples. The ones I think we surely need are:
If we're able to find the ones we need, I think it'd also be great if we could have something like:
That should make the examples much simpler, and directly show the point they are trying to show. See for example the MultiIndex example here, how creating the DataFrame distracts from the operation shown: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
@python-sprints/pandas-mentoring thoughts? Ideas on datasets?
The text was updated successfully, but these errors were encountered: