Data parsing
User data is the entry point of the RAWGraphs workflow, and the rawgraphs-core library introduces some concepts and utilities to deal with user provided datasets.
RAWGraphs works on a tabular dataset, that will be transformed internally according to the chart type we want to draw.
The task of importing such kind of dataset from common formats like csv into javascript is already solved by other libraries (we use d3-dsv in the RAWGraphs app), but as we normally work with text-based data formats, we also must define the data types of the different columns.
For example in this csv dataset:
all data points and values are formally strings, but the column year
could be paresed as a date, and the columns orders
and total
are numbers.
#
Data Types in RAWGraphsRAWGraphs has the concept of data type and can handle strings, number and dates.
When handling a user dataset, it is required that each data column in the tabular dataset has the same data type for all the data points.
When we create an instance of a rawgraphs chart you have the ability to declare the datatypes of the columns in the dataset, otherwise raw will infer them from the dataset. Given a set of column, its datatypes are expressed as an object where keys are the names of columns and values a- one of "number", "date", or "string".
For example:
Let's go back to the basic example of rendering a chart:
As you can see there's no mentioning of datatypes, but under the hood rawgraphs has been able to
identify the "age" and "height" columns in the dataset. In this case this is important as the bubblechart
only accepts numbers or dates on the x
dimension and numbers on the y
dimensions.
We could have been more verbose, by specifying the datatypes:
In this way you can "force" the visualization to use your explicit data types.
#
"Real-world" data and data types inferenceWhen dealing with real-world datasets, we often start from spreadsheets, text files, database exports, copy and paste, that come with no information about data types, and are often "formatted" with conventions based on language and culture of who produced the data.
An obvious case is related to dates formats, which standard change from nation to nation, or that can be expressed with a mixture of words and numbers (ex: Jan 2021), specify date and time (ex: 2021-01-01 18:00:00), date only (ex: 2021-01-01) or just a part of it (ex: 2021).
Another case is number formatting, where the dot .
or the comma ,
sign are used as decimal separators in different languages.
RAWGraphs includes some functions used in the RAWGraphs app to solve this problem for the, with the inferTypes
and parseDataset
functions.
These utilities are used in the "Data loading" section of the app.
inferTypes
#
This function can be used to detect the data types of a dataset. The signature is the following
DataTypes
#
inferTypes(data, parsingOptions) β - The
data
parameter is the array of objects that must be parsed - The
parsingOptions
is an optional objects with the following properties:
Name | Type | Description |
---|---|---|
strict | boolean | if strict is false, a JSON parsing of the values is tried. (if strict=false: "true" -> true) |
locale | string | |
decimal | string | |
group | string | |
numerals | Array.<string> | |
dateLocale | string |
The return value of the function is an object representing the guessed data types. Its shape is described in the api docs
and extends the datatypes definition of RAWGraphs by allowing to specify custom properties for the different data types.
For example, the date
format allows to specify a dateFormat
property.
Example of datatypes
info
The dateFormat
property in each data type definition plays a role similar to the options available in the parsingOptions
parameter, the only difference is that the date format may be specified for each column, while the numeric formatting is
global. This reflects the actual user interface of the RAWGraphs app.
Let's see how the function is used:
inferTypes example
As you can see, even if the b
column of our dataset is formally a string, the library was able to cast it to numbers.
info
For each column, Rawgraphs tries to cast each datum to each data type. If the majority of data points can be casted to a type, that type is chosen for the column.
Let's see a couple more of examples. The function is able to detect ISO dates formats:
And we can use the parsingOption
parameter for specifying that the comma is the decimal separator:
info
The function inferTypes
limits its search for data types to numbers with .
as decimal separator, dates and datetimes in ISO format,
but doesn't explore all the possible date formats and formatting options.
parseDataset
#
This function can be used to parse a "formatted" dataset, already split in row and columns, and has the following syntax:
ParserResult
#
parseDataset(data, [types], [parsingOptions]) β - The
data
parameter is the array of objects that must be parsed - The
types
is an optional object specifying the types we want to enforce, with the same syntax of the dataTypes output described for theinferTypes
function - The
parsingOptions
(optional) is the same object described for theinferTypes
function.
Note that in case we don't pass the types
parameter, the library will try and detect types automatically using the inferTypes
function described above,
and if no datatype can be detected for a column, the library will recognize it as a simple string.
The function returns an object with three properties:
dataset
: the set of rows that could be parseddataTypes
: the datatypes used for parsingerrors
: errors that prevented from parsing rows according to dataTypes
Examples
Let's try and parse a dataset with some values changing type in the different rows. We won't specify any "hint" for data types and no parsing options.
Here's an example of parsing dates with some parsing errors:
In this case the library has been able to detect the b
column as ISO date, but could just parse correctly 2 of 3 records in the dataset, and the array
of errors contains information about the row and column that could not be parsed.
Values that cannot be parsed with the specified type become undefined
in the parsed dataset.