Research in Statistical Sciences Interactive Visualizations

The javascript libraries used for the visualizations are dc.js , with native cross filter support for exploration of multidimensional datasets, d3.js , a highly popular javascript library for data visualization and crossfilter.js , for fast multidimensional filtering for coordinated views.

The size and scope of the literature on statistics can be overwhelming, which makes it difficult to identify emerging trends and see the relationships between different developments. Visualization techniques, coupled with statistical and data mining methods, have been found effective in achieving these goals in a number of application domains including healthcare and manufacturing research. In this paper, we apply these concepts to the field of statistical sciences. Our dataset is based on bibliographic information, including but not limited to authors, keywords, abstracts, citations, and funding information, extracted from 8,191 papers published in the 17 journals of the American Statistical Association (ASA) in the period of 1991-2014. These bibliographic units of analyses allow us to address the following questions: a) What are the main research fields within statistics (based on a data-driven approach)? b) How do these research fields relate to each other? c) How do these fields develop over the time period from 1991-2014? And d) What are the main drivers for these publications? Our results indicate that using bibliometric visualization approaches can provide insights from analyzing the massive amounts of literature that has been published and cited by ASA papers over the past twenty four years.

Initially a very high level view of the data is presented, focusing on the important dimensions of this multi dimensional data set. The visualizations below focus on exploration as well as interactivity. The narration for each visualization essentially entails the important information obtained from the visualization.

Journal Names (High Level Exploration)

The chart below explores the variation in number of papers as well as the number of times they are cited, based on the journals in which the papers were published.This high level exploration provides a high level view about the various sects of statistical research. One important feature that a user has is the ability to filter the data set based on the journal names. By clicking on the 9 bubbles below, the user can filter the dataset based on the journal names. But most importantly this bubble chart shows the popularity {in terms of total times cited on the x axis, times cited on the y axis (These are different because times cited are the number of papers which cited the particular papers in the specified journals and total times cited includes multiple citations within the papers)} of the papers as well as the number of papers in the given journal in terms of the size of the bubbles. The main aim of this chart, is to facilitate interactivity and data filtering, and not detailed visual exploration. We get some important insights from this high level visualization. The order in which the bubbles are arranged in the horizon essentially shows the variation in number of citations as well as the number of statistical research papers in each field of research. While the papers published in JASA (Journal of the American Statistical Association) are cited more, the papers in particular research field journals like Stat. Biopharm. Res.(Statistics in Biopharmaceutical Research) are cited relatively lesser. This shows the popularity of papers published in JASA. Also papers published in specific research journals may not be used generally for statistical research in other fields of study.

Time series visualizations

The visualizations given below is aimed at exploring the data set as panel data, looking at the variations over the years. The first of these time series visualizations is a high level composite time series chart, showing the variation in statistical research over the years in specific research fields. It is to be noted that sometimes, the presence of a dominating group overshadows variation in other groups. This can be seen in this dataset too. The number of papers in the general label-"Statistics and Probability" outnumbers the number of papers in all other fields. All JASA papers have the research field label of "Statistics and Probability". Hence to see the variation in other research fields the user shall eliminate these general papers from the mix. To facilitate this filtering, we have provided an extra donut chart below. This filtering can be achieved by clicking on all other segments other than JASA in the donut below.

Another important feature of these visualizations is the capability to select the variability that the user wants to observe in these datasets. The 2 buttons below- "Number of papers" and "Number of citations", means the same as their title. By clicking any one of these, the variability shown in all the time series charts below changes. By clicking the first button, all the charts switch to number of papers and by clicking on the second all the charts switch to number of citations. The "Reset all the viz" option below these buttons resets all the visualizations. It can be used to negate all the filtering. Finally we have a table which gives more detail about the papers. The table also undergoes filtering when other filters are applied.

Certain important observations can be noted from the 1st time series. While the number of papers in fields like Economics, Social Sciences etc. are almost at a constant rate from 1991 till 2014, we can see an increase in Computer Science papers in the period starting in 2010. This can most probably be attributed to the "Data revolution" which started during this time period. If we look at papers in Statistics and Probability, we see a gradual increase from 2000 onwards compared to the other years. The beginning of the Data revolution and increase in usage of statistical methods for Big Data Analysis may have been the reason for this increase in research. Also computational power has increased by many times starting in the 2000's. Compared to 1990's, nowadays researchers can use their personal computers for statistical research. This improvement in computational technology also might have triggered this increase in papers in Statistics and probability.

While looking at the citations, certain other important observations can be made. The number of citations for Economics related papers have declined gradually from 2003. Surprisingly enough, we see a similar decrease in Economics papers as per another research conducted by Cardaso et al. in 2010 ( Trends in Economic Research: An International Perspective ). Although the number of statistical papers in the Economics field remained almost constant, the number of citations have decreased. We see a similar trend in social sciences too. However the decrease in citations for Statistics and Probability papers (mostly JASA papers) from 2002 onwards clearly shows that even in the statistical research in 2010's, the techniques from the 1990's are used. This means that after the 1990's, no breakthrough techniques nor research have been conducted in Statistics, which overpowers the techniques found in 1990's. However it should be remembered that the number of papers in Statistics and Probability have been increasing from the 2000's. From all this, we can come to the conclusion that the Statistics Research community in USA has focused on applying and improving the techniques found earlier. Although it seems not to be a progressive path, the increase in research is clearly a sign of improvement. Considering all the facts, it is safe to predict that the research conducted in 2000's may be used in later decades.

Many other interesting observations can be made by using the selectivity and filtering options available in these Visualizations. More time series Visualizations are given below, looking at the occurrence of specific keywords in the research papers. These Visualizations are based on the 10 clusters defined in the main article. There are many more interesting observations given under each of these Visualizations.

Reset All the Viz

Time Series Cluster 1 - Reliability/Survival Analysis

The time series chart below explores the variation in number of papers as well as the number of times they are cited, across years. This cluster is named as Reliability/Survival Analysis because of the presence of "Survival Analysis" in this cluster. Other keywords can be found in many other fields of study. For switching between number of papers and citations the same buttons above can be used.

From looking at the number of papers alone we can see that "Bootstrap" is a keyword which is prominently used in statistical research from 1991 till 2014. This is a certain indication that random re-sampling method will be consistently used through out the evolution of Statistics as it is the backbone of all statistical analysis. "Measurement Error" also consistently appears in the literature through out the years. This is also intuitive since variance from the true value is always inherent in any kind of analysis.

When we look at the number of citations, we can see that the statistical community tend to cite more papers with "Bootstrap", "Measurement Error" etc. in 1990's. This shows that the researches still refer to seminal papers with these keywords from the 1990's for their research. These keywords are from well established realms of statistical analysis. Hence the old papers are still cited although these techniques are used in recent researches too.

Time Series Cluster 3 - Statistics in Medicine

The time series chart below explores the variation in number of papers as well as the number of times they are cited. For switching between number of papers and citations the same buttons above can be used.

We can see some really interesting trends in the usage of certain keywords in this cluster. The most interesting trend is in the usage of the keyword "R". It represents the R software, which is the most popular tool for statistical analysis even 22 years after its inception. The explosion of papers using R for the analysis shows how a programming language can also influence research. This strongly agrees with the fact that "Data Science" is an optimal combination of Statistics, Programming and Domain knowledge.There have been an explosion in statistical research papers using R starting from 2004. This strongly correlates with the fact that first stable version of R came out in the 2000's and easy data loading feature was added in 2004. After this the widespread insemination of R took place within the Statistical community aided by the usage of internet.

Another important trend we can notice is the upward move in the usage of the keyword "Dimension Reduction" from 2000's. Variable selection and Dimensionality reduction are terms associated with Machine learning and the explosion in research in Machine learning during the 2000's till now may be the reason behind this. Finally "Visualization" was intentionally included in this cluster to show an interesting trend. Visualization is trending from 2008 onwards. The rise of programming languages like D3 js for interactive web visualizations for exploratory data analysis during this period may have caused it to trend upwards during this period.

When looking at citations we see that almost all the citations are form the 1990's. But "R" stands out here also.

Time Series Cluster 4- Model and Variable Selection

The time series chart below explores the variation in number of papers as well as the number of times they are cited, across years. For switching between number of papers and citations the same buttons above can be used.

This cluster has some keywords related to Model and Variable Selection. The important keywords like Generalized Linear Model, Logistic Regression etc. are well established techniques in modeling and variable selection. From the interactive time series visualization below indicating the variability in number of papers with each of these keywords, we can observe that these methods and techniques are used consistently from 1991 till now. When the variability in number of citations is considered, we see a similar trend as seen in visualizations given above. The papers in the 1990's containing these keywords are cited more when compared to the ones after 2000. The spike in "Importance Sampling" in 1995 is to be noted. This can be attributed to the paper titled "BAYES FACTORS" by Robert E. Kass and Adrian E. Raftery which was published in 1995. It is considered as a seminal paper in Bayesian hypothesis testing.

Time Series Cluster 5- Data Quality

This cluster has some very important keywords connected to Data Quality like Markov Chain Monte Carlo (MCMC), EM Algorithm etc. While observing the number of papers over the years, the increase in number of papers with MCMC as a keyword is to be noted. This rise in number of papers using MCMC can certainly be attributed to recent increase in usage of MCMC method for Hierarchical models in Bayesian Statistics. When observing the number of citations we can see that seminal papers in the 1990's as well as 2000's are cited the most for these important keywords. The spike in number of citations for MCMC in 1995 can again be attributed to the single seminal paper "BAYES FACTORS" by Robert E. Kass and Adrian E. Raftery mentioned earlier. However the papers in the 2000's with the keyword "Markov Chain Monte Carlo" have also been cited consistently more till 2010. The earlier mentioned usage of MCMC method for Hierarchical Modelling may be one reason for this. Adding to this, the number of citations for papers with keyword- "Hierarchical Model" is more from 2005 till 2010, which is the same period mentioned above.

Mapping Research in Statistical Sciences

A Visual Exploration of 8,191 Publications (1991-2014) in the Journals of the American Statistical Association and their 119,459 Citations

Research in Statistical Sciences Interactive Visualizations