Title: | X Ray Vision on your Datasets |
---|---|
Description: | Tools to analyze datasets previous to any statistical modeling. Has various functions designed to find inconsistencies and understanding the distribution of the data. |
Authors: | Pablo Seibelt [aut, cre] |
Maintainer: | Pablo Seibelt <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.2.900 |
Built: | 2024-11-05 02:56:14 UTC |
Source: | https://github.com/sicarul/xray |
If any anomalous columns are found, they are reported as a warning and returned in a data.frame. To interpret the output, we are getting these anomalies:
NA values: NA
0 values: Zero
Blank strings: Blank
Infinite numbers: Inf
anomalies(data_analyze, anomaly_threshold = 0.8, distinct_threshold = 2)
anomalies(data_analyze, anomaly_threshold = 0.8, distinct_threshold = 2)
data_analyze |
a data frame or tibble to analyze |
anomaly_threshold |
the minimum percentage of anomalous rows for the column to be problematic |
distinct_threshold |
the minimum amount of distinct values the column has to have to not be problematic, usually you want to keep this at it's default value. |
All of these value are reported in columns prefixed by q (quantity), indicating the rows with the anomaly, and p (percentage), indicating percent of total rows with the anomaly.
And, also any columns with only one distinct value, which means the column doesn't bring information to the table (If all rows are equal, why bother having that column?). We report the number of distinct values in qDistinct.
library(xray) anomalies(mtcars, anomaly_threshold=0.5)
library(xray) anomalies(mtcars, anomaly_threshold=0.5)
Also returns a table of all numeric variables describind it's percentiles 1, 10, 25, 50 (median), 75, 90 and 99.
distributions(data_analyze, outdir, charts = T)
distributions(data_analyze, outdir, charts = T)
data_analyze |
a data frame to analyze |
outdir |
an optional output directory to save the resulting plots as png images |
charts |
set this to false to avoid generating charts, useful for batch script usage |
library(xray) distributions(mtcars)
library(xray) distributions(mtcars)
Analyze each variable in respect to a time variable
timebased(data_analyze, date_variable, time_unit = "auto", nvals_num_to_cat = 2, outdir)
timebased(data_analyze, date_variable, time_unit = "auto", nvals_num_to_cat = 2, outdir)
data_analyze |
a data frame to analyze |
date_variable |
the variable (length one character vector or bare expression) that will be used to pivot all other variables |
time_unit |
the time unit to use if not automatically |
nvals_num_to_cat |
numeric numeric values with this many or fewer distinct values will be treated as categorical |
outdir |
an optional output directory to save the resulting plots as png images |
library(xray) data(longley) longley$Year=as.Date(paste0(longley$Year,'-01-01')) timebased(longley, 'Year')
library(xray) data(longley) longley$Year=as.Date(paste0(longley$Year,'-01-01')) timebased(longley, 'Year')