V3: Tools for task views

Patrice Kiener

2019-08-19

Introduction

RWsearch stands for « Search in R packages, task views, CRAN and in the Web ».

RWsearch is a management tool that serves several purposes:

  1. Provide a simple instruction to read and evaluate non-standard content (non-standard evaluation).
  2. Download the list of all packages and task views available on CRAN at a given date and rearrange them in a convenient format.
  3. List the packages that has been added, updated and removed between two dates.
  4. Search packages that match one or several keywords (with mode or, and, relax) in any column of the data.frame and by default in columns Package name, Title, Description, Author and Maintainer (can be selected individually).
  5. Arrange the results in different format (list, tables with different columns) and display them in console or in txt, md, html, pdf files.
  6. In one instruction, download on your disk the whole documentation related to one or several packages. This is the perfect tool to read the documentation off-line.
  7. Provide tools for task view maintenance: Detect the packages recently added to or updated in CRAN, check if they match some keywords, check if they are already recorded in a given task views.
  8. Provide quick links to launch web search engines and search for keywords from the R console. This list of web search engines is expected to grow (with your help).

Compare to other packages (packagefinder, websearchr) or web services (RDocumentation, rdrr) with similar objectives, the search options of RWsearch are more sophisticated and allow for a finer search. RWsearch also addresses a much larger number of web search engines. Non-standard evaluation (or evaluation of non-standard content) makes it user friendly.

This vignette deals with point 7. Points 1 to 6 and point 8 are discussed in other vignettes.

Task views

Task views are an important part of CRAN as they help R users to find the packages that match their needs. Task view pages are maintained by volunteers who update the pages and upload them to CRAN with the package ctv.

RWsearch focuses on detecting the (old and new) packages that could be added to the task views.

View, download, load and explore task views

As a preliminary, all task views can be read on-line with the instruction:

h_crantv()
# Open the html page of CRAN task views in the browser

RWsearch provides the functions tvdb_down() and tvdb_load() to download a file renamed or load the task views in .GlobalEnv as a list named tvdb. Four functions are available to explore this list: tv_vec(), tvdb_dfr(), tvdb_list(), tvdb_pkgs().

crandb_down(), tvdb_down(), crandb_load(), tvdb_load()

In an empty directory, the two files crandb.rda and tvdb.rda, which are cleaned versions of the files /web/packages/packages.rds and /src/contrib/Views.rds, must be downloaded from CRAN. Once downloaded, the files are automatically loaded with names crandb (a data.frame) and tvdb (a list) in .GlobalEnv. The most recently updated task view can have an older date than the saved date.

crandb_down()
# $newfile
# [1] "crandb.rda saved and loaded. 13752 packages listed between 2006-03-15 and 2019-02-25"

tvdb_down()
# tvdb.rda saved. tvdb loaded. 39 task views listed between 2015-01-07 and 2019-02-25

ls()
# [1] "crandb" "tvdb"  

If the files already exist in one specific directory, they can be loaded:

crandb_load(filename = "crandb-2019-0226.rda") 
# [1] "crandb.rda saved and loaded. 13752 packages listed between 2006-03-15 and 2019-02-25"

tvdb_load(filename = "tvdb-2019-0226.rda") 
# tvdb loaded. 39 task views listed between 2015-01-07 and 2019-02-25

tvdb_vec(), tvdb_dfr()

Functions tv_vec() and tvdb_dfr() list the task views available and their topics.

There are 39 task views in February 2019. The largest task view is Time Series with 273 referred packages. 31 task views have been updated in the last 100 days and 37 task views in the last 12 months.

tvdb_vec() 

#  [1] "Bayesian"                  "ChemPhys"                  "ClinicalTrials"           
#  [4] "Cluster"                   "Databases"                 "DifferentialEquations"    
#  [7] "Distributions"             "Econometrics"              "Environmetrics"           
# [10] "ExperimentalDesign"        "ExtremeValue"              "Finance"                  
# [13] "FunctionalData"            "Genetics"                  "gR"                       
# [16] "Graphics"                  "HighPerformanceComputing"  "Hydrology"                
# [19] "MachineLearning"           "MedicalImaging"            "MetaAnalysis"             
# [22] "MissingData"               "ModelDeployment"           "Multivariate"             
# [25] "NaturalLanguageProcessing" "NumericalMathematics"      "OfficialStatistics"       
# [28] "Optimization"              "Pharmacokinetics"          "Phylogenetics"            
# [31] "Psychometrics"             "ReproducibleResearch"      "Robust"                   
# [34] "SocialSciences"            "Spatial"                   "SpatioTemporal"           
# [37] "Survival"                  "TimeSeries"                "WebTechnologies"          
tvdb_dfr()

#    version    npkgs name                      topic                                                                
# 1  2019-01-29 131   Bayesian                  Bayesian Inference                                                   
# 2  2019-02-05  91   ChemPhys                  Chemometrics and Computational Physics                               
# 3  2019-02-20  51   ClinicalTrials            Clinical Trial Design, Monitoring, and Analysis                      
# 4  2019-01-14 111   Cluster                   Cluster Analysis & Finite Mixture Models                             
# 5  2019-02-04  38   Databases                 Databases with R                                                     
# 6  2019-01-26  29   DifferentialEquations     Differential Equations                                               
# 7  2019-01-26 208   Distributions             Probability Distributions                                            
# 8  2019-01-26 139   Econometrics              Econometrics                                                         
# 9  2019-01-17 104   Environmetrics            Analysis of Ecological and Environmental Data                        
# 10 2018-05-26 124   ExperimentalDesign        Design of Experiments (DoE) & Analysis of Experimental Data          
# 11 2018-12-03  24   ExtremeValue              Extreme Value Analysis                                               
# 12 2019-01-26 154   Finance                   Empirical Finance                                                    
# 13 2019-01-26  40   FunctionalData            Functional Data Analysis                                             
# 14 2019-01-26  26   Genetics                  Statistical Genetics                                                 
# 15 2019-01-29  35   gR                        gRaphical Models in R                                                
# 16 2015-01-07  41   Graphics                  Graphic Displays & Dynamic Graphics & Graphic Devices & Visualization
# 17 2018-10-30  97   HighPerformanceComputing  High-Performance and Parallel Computing with R                       
# 18 2019-01-24  86   Hydrology                 Hydrological Data and Modeling                                       
# 19 2019-02-19 102   MachineLearning           Machine Learning & Statistical Learning                              
# 20 2018-11-30  32   MedicalImaging            Medical Image Analysis                                               
# 21 2019-01-26 104   MetaAnalysis              Meta-Analysis                                                        
# 22 2019-02-25 142   MissingData               Missing Data                                                         
# 23 2019-02-04  31   ModelDeployment           Model Deployment with R                                              
# 24 2018-07-21 121   Multivariate              Multivariate Statistics                                              
# 25 2017-11-29  55   NaturalLanguageProcessing Natural Language Processing                                          
# 26 2019-02-17  96   NumericalMathematics      Numerical Mathematics                                                
# 27 2019-01-23 123   OfficialStatistics        Official Statistics & Survey Methodology                             
# 28 2019-02-17 127   Optimization              Optimization and Mathematical Programming                            
# 29 2018-06-11  12   Pharmacokinetics          Analysis of Pharmacokinetic Data                                     
# 30 2019-02-14  94   Phylogenetics             Phylogenetics, Especially Comparative Methods                        
# 31 2019-02-11 209   Psychometrics             Psychometric Models and Methods                                      
# 32 2019-01-26  59   ReproducibleResearch      Reproducible Research                                                
# 33 2018-06-18  60   Robust                    Robust Statistical Methods                                           
# 34 2018-06-18  82   SocialSciences            Statistics for the Social Sciences                                   
# 35 2019-02-25 194   Spatial                   Analysis of Spatial Data                                             
# 36 2019-02-05  65   SpatioTemporal            Handling and Analyzing Spatio-Temporal Data                          
# 37 2019-01-26 264   Survival                  Survival Analysis                                                    
# 38 2019-02-12 273   TimeSeries                Time Series Analysis                                                 
# 39 2019-02-25 247   WebTechnologies           Web Technologies and Services                                        

tvdb_list(), tvdb_pkgs()

tv_list() lists all task view packages. tvdb_dfr() lists the packages of the task views selected by the user.

tvdb_list()

# $Bayesian
# [1]   "abc"      "abn"      "AdMit"    "arm"      "AtelieR"  
# ...
# [127] "spTimer"  "ssgraph"  "stochvol" "tgp"      "zic"     
# ...
# ...
# $WebTechnologies
# [1]   "abbyyR"    "ajv"       "analogsea" "aRxiv"     "aws.polly" 
# ...
# [243] "xslt"      "yhatr"     "yummlyr"   "zendeskR"  "ZillowR"  
tvdb_pkgs(ChemPhys, Distributions, Robust)

# $ChemPhys
#  [1] "ALS"               "AnalyzeFMRI"       "AquaEnv"           "astro"
# ...
# [89] "varSelRF"          "webchem"           "WilcoxCV"
#
# $Distributions
#   [1] "actuar"           "AdMit"             "agricolae"         "ald"
# ...
# [205] "visualize"        "Wrapped"           "zipfextR"          "zipfR"
#
# $Robust
#  [1] "cluster"        "complmrob"      "covRobust"      "coxrobust"      "distr"
# ...
# [56] "ssmrob"         "tclust"         "TEEReg"         "walrus"         "WRS2"

Counting the number of referred packages

The largest task view has referred 273 packages. All task views have referred 4021 packages but only 3221 unique packages. This is 23.4 % of the total number of packages currently in CRAN. The task is immense if we want to refer all packages.

nrow(crandb)
# 13752 

max(sapply(tvdb_list(), length))
# [1] 273

length(unlist(tvdb_list()))
# [1] 4021

length(unique(unlist(tvdb_list())))
# [1] 3221

Task view maintenance

RWsearch focuses on detecting the (old and new) packages that could be added to the task views.

Function s_crandb_tvdb() combines many functions of RWsearch. It searchs packages in crandb by keywords and checks if the found packages are referred in the selected task view and installed in the computer (in the /site-library directory).

The example presented below refers to the Distributions task view.

s_crandb_tvdb()

The output of s_crandb_tvdb() is a list with the following items:

keywords <- cnsc(probability, distribution)
(lst <- s_crandb_tvdb(char = keywords, tv = "Distributions", 
                      from = -15, to = "2019-02-25", select = "PT"))

# $spkgs
# [1] "distreg.vis" "GB2group"    "ipcwswitch"  "MixGHD"      "MSPRT"       "pgdraw"     
# [7] "philentropy" "triangle"   

# $inTV
# [1] "triangle"

# $notinTV
# [1] "distreg.vis" "GB2group"    "ipcwswitch"  "MixGHD"      "MSPRT"       "pgdraw"     
# [7] "philentropy"

# $inTV_in
# character(0)

# $inTV_un
# [1] "triangle"

# $notinTV_in
# character(0)

# $notinTV_un
# [1] "distreg.vis" "GB2group"    "ipcwswitch"  "MixGHD"      "MSPRT"       "pgdraw"     
# [7] "philentropy"   

Visualize the unreferred packages

A first assesment of the packages that match the keywords can be done with p_table2(). Consider also p_display5() to read the results in the browser.

p_table2(), p_display5()

p_table2(lst$notinTV)

#      Package     Title                                                                                                                     
# 2766 distreg.vis Framework for the Visualization of Distributional Regression Models                                                       
# 4209 GB2group    Estimation of the Generalised Beta Distribution of the Second Kind from Grouped Data                                      
# 5592 ipcwswitch  Inverse Probability of Censoring Weights to Deal with Treatment Switch in Randomized Clinical Trials                      
# 7072 MixGHD      Model Based Clustering, Classification and Discriminant Analysis Using the Mixture of Generalized Hyperbolic Distributions
# 7404 MSPRT       Modified Sequential Probability Ratio Test (MSPRT)                                                                        
# 8613 pgdraw      Generate Random Samples from the Polya-Gamma Distribution                                                                 
# 8659 philentropy Similarity and Distance Quantification Between Probability Functions                                                      
p_display5(lst$notinTV)
## read the html page launched in your browser

From this list and the DESCRIPTION in the html page, packages GB2group, MixGHD, pgdraw and philentropy deserve further exploration.

p_page(), p_pdfweb()

4 packages is a reasonnable number. We can open the CRAN/index.html pages and CRAN/manual.pdf files in the browser. Alternatively, consider downloading the documentation of these packages.

pkgs <- cnsc(GB2group, MixGHD, pgdraw, philentropy)

p_page(char = pkgs)
## read the 4 html pages launched in your browser

p_pdfweb(char = pkgs)
## read the 4 pdf pages launched in your browser

p_down(char = pkgs)
## read the files saved in the current directory

The pdf files and the vignettes give all expected information:

For these packages, the quality of the d,p,q,r functions should be explored (but has not been yet).