The tool Data for Research (DfR) by JSTOR is a valuable source for citation analysis and text mining. jstor provides functions and suggests workflows for importing datasets from DfR.

When using DfR, requests for datasets can be made for small excerpts (max. 25,000 records) or large ones, which require an agreement between the researcher and JSTOR. jstor was developed to deal with very large datasets which require an agreement, but can be used with smaller ones as well.

I will demonstrate their usage using the sample dataset which is provided by JSTOR on their website.

General Concept

All functions from the jst_get_* family which are concerned with meta data operate along the same lines:

The file is read with xml2::read_xml().
Content of the file is extracted via XPATH or CSS-expressions.
The resulting data is returned in a tidy tibble.

The functions are similar in that all operate on single files (article, book, research report or pamphlet). Depending on the content of the file, the output of the functions might have one or multiple rows. jst_get_article always returns a tibble with one row: the core meta data (like title, id, or first page of the article) are single items, and only one article is processed at a time. Running jst_get_authors for the same article might give you a tibble with one or multiple rows, depending on the number of authors the article has. The same is true for jst_get_references and jst_get_footnotes. If a file has no data on references (they might still exist, but JSTOR might not have parsed them), the output is only one row, with missing references. If there is data on references, each entry gets its own row. Note however, that the number of rows does not equal the number of references. References usually start with a title like “References”, which is obviously not a reference to another article. Be sure to think carefully about your assumptions and to check the content of your data before you make inferences.

Books work a bit differently. Searching for data on https://www.jstor.org/dfr/results lets you filter for books, which are actually book chapters. If you receive data from DfR on a book chapter, you always get one xml-file with the whole book, including data on all chapters. Ngram or full-text data for the same entry however is processed only from single chapters¹. Thus, the output of jst_get_book for a single file is similar to the one from jst_get_article: it is one row with general data about the book. jst_get_chapters gives you data on all chapters, and the resulting tibble therefore might have multiple rows.

The following sections showcase the different functions separately.

Application

Apart from jstor we only need to load dplyr for matching records and knitr for printing nice tables.

library(jstor)
library(dplyr)
library(knitr)

jst_get_article

The basic usage of the jst_get_* functions is very simple. They take only one argument, the path to the file to import:

meta_data <- jst_get_article(file_path = jst_example("article_with_references.xml"))

The resulting object is a tibble with one row and 17 columns. The columns correspond to most of the elements documented here: https://www.jstor.org/dfr/about/technical-specifications.

The columns are:

file_name (chr): The file name of the original .xml-file. Can be used for joining with other parts (authors, references, footnotes, full-texts).
journal_doi (chr): A registered identifier for the journal.
journal_jcode (chr): A identifier for the journal like “amerjsoci” for the “American Journal of Sociology”.
journal_pub_id (chr): Similar to journal_jcode. Most of the time either one is present.
article_doi (chr): A registered unique identifier for the article.
article_jcode (chr): A unique identifier for the article (not a DOI).
article_pub_id (chr): Infrequent, either part of the DOI or the article_jcode.
article_type (chr): The type of article (research-article, book-review, etc.).
article_title (chr): The title of the article.
volume (chr): The volume the article was published in.
issue (chr): The issue the article was published in.
language (chr): The language of the article.
pub_day (chr): Publication day, if specified.
pub_month (chr): Publication month, if specified.
pub_year (int): Year of publication.
first_page (int): Page number for the first page of the article.
last_page (int): Page number for the last page of the article.

Since the output from all functions are tibbles, the result is nicely formatted:

meta_data %>% kable()

file_name	journal_doi	journal_jcode	journal_pub_id	journal_title	article_doi	article_pub_id	article_jcode	article_type	article_title	volume	issue	language	pub_day	pub_month	pub_year	first_page	last_page	page_range
article_with_references	NA	tranamermicrsoci	NA	Transactions of the American Microscopical Society	10.2307/3221896	NA	NA	research-article	On the Protozoa Parasitic in Frogs	41	2	eng	1	4	1922	59	76	59-76

jst_get_authors

Extracting the authors works in similar fashion:

authors <- jst_get_authors(jst_example("article_with_references.xml"))
kable(authors)

file_name	prefix	given_name	surname	string_name	suffix	author_number
article_with_references	NA	R.	Kudo	NA	NA	1

Here we have the following columns:

file_name: The same as above, used for matching articles.
prefix: A prefix to the name.
given_name: The given name of the author (i.e. Albert or A.).
surname: The surname of the author (i.e. Einstein).
string_name: Sometimes instead of given_name and surname, only a full string is supplied, i.e.: Albert Einstein, or Einstein, Albert.
suffix: A suffix to the name, as in Albert Einstein, II..
author_number: An integer representing the order of how the authors appeared in the data.

The number of rows matches the number of authors – each author get its’ own row.

jst_get_references

references <- jst_get_references(jst_example("article_with_references.xml"))

# # we need to remove line breaks for knitr::kable() to work properly for printing
references <- references %>%
  mutate(ref_unparsed = stringr::str_remove_all(ref_unparsed, "\\\n"))

We have two columns:

file_name: Identifier, can be used for matching.
ref_title: The title of the references sections.
ref_authors: A string of authors. Several authors are separated with ;.
ref_editors: A string of editors, if present.
ref_collab: A field that may contain information on the authors, if authors are not available.
ref_item_title: The title of the cited entry.
ref_year: A year, often the article’s publication year, but not always.
ref_source: The source of the cited entry. For books often the title of the book, for articles the publisher of the journal.
ref_volume: The volume of the journal article.
ref_first_page: The first page of the article/chapter.
ref_last_page: The last page of the article/chapter.
ref_publisher: For books the publisher, for articles often missing.
ref_publication_type: Known types: book, journal, web, other.
ref_unparsed: The full references entry in unparsed form.

Here I display 5 random entries:

references %>% 
  sample_n(5) %>% 
  kable()

file_name	ref_title	ref_authors	ref_collab	ref_item_title	ref_year	ref_source	ref_volume	ref_first_page	ref_last_page	ref_publisher	ref_publication_type	ref_unparsed
article_with_references	References: Trypanosomes	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	LEBEDEFF, A.1910 Ueber Trypanosoma rotatorium Gruby. Festschr. 60sten Geburts. RichardHertwigs, 1:397-436, 2 pl., 9 textfig.
article_with_references	References: Opalinae	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	METCALF, M. M.1909 Opalina. Its anatomy and reproduction, with a description of infection experi-ments and a chronological review of the literature. Arch. Protist., 13. 181pp., 15 pl. and 15 textfig.
article_with_references	References: Leptotheca ohilmacheri	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1922 On the morphology and life history of a Myxosporidian, Leptotheca ohlmacheri,parasitic in Rana clamitans and Rana pipiens. Parasitology, 14, no. 2.
article_with_references	References: Trypanosomes	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	BRUMPT, E.1906 Rôle pathoghne et mode de transmission du Trypanosoma inopinatum Ed. et Et.Sergent. Mode d’inoculation d’autres trypanosomes. C. R. soc. biol.,61:167-169.
article_with_references	References: Trypanosomes	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	LAVERAN, A. and E. MESNIL (translated and revised by Nabarro).1907 Trypanosomes and trypanosomiases. Chicago. 538 pp., 1 pl. and 8 textfig.

This example shows several things: file_name is identical among rows, since it identifies the article and all references came from one article. The the sample file doesn’t follow a typical convention (it was published in 1922), therefore there are several different headings (ref_title). Usually, this is only “Bibliography” or “References”.

Since the references were not parsed by JSTOR, we only get an unparsed version. In general, the content of references (unparsed_refs) is in quite a raw state, quite often the result of digitising scans via OCR. For example, the last entry reads like this: MACHADO, A.1911 Zytologische Untersuchungen fiber Trypanosoma rotatorium .... There is an error here: fiber should be über. The language of the source is German, but the OCR-software assumed English. Therefore, it didn’t recognize the Umlaut. Similar errors are common for text read via OCR.

For other files, we can set parse_refs = TRUE, so references will be imported in their parsed form, whenever they are available.

jst_get_references(
  jst_example("parsed_references.xml"),
  parse_refs = TRUE
) %>% 
  kable()

file_name	ref_title	ref_authors	ref_editors	ref_collab	ref_item_title	ref_year	ref_source	ref_volume	ref_first_page	ref_last_page	ref_publisher	ref_publication_type	ref_unparsed
parsed_references	Notes	NA	NA	NA	NA	2005	NA	NA	NA	NA	NA	other	1. The USA PATRIOT Act expanded the government’s surveillance power in numerous other ways (see, e.g. Keenan 2005 ).
parsed_references	References	Acohido, B.; Eisler, P.	NA	NA	“Snowden Case: How Low-Level Insider Could Steal from NSA”	2013	USA Today	NA	NA	NA	NA	other	Acohido, B. and Eisler, P. ( 2013 ) “Snowden Case: How Low-Level Insider Could Steal from NSA” , USA Today , 12 June. Available online at http://www.google.com (accessed 15 June 2013).
parsed_references	References	NA	NA	Amnesty International	NA	2013	“USA: Revelations about Government Surveillance ‘raise red flags’”	NA	NA	NA	NA	other	Amnesty International ( 2013 ) “USA: Revelations about Government Surveillance ‘raise red flags’” , 7 June. Available online at http://www.google.com (accessed 14 June 2013).
parsed_references	References	Jacobson, D.	D. E. Davis; J. Go	NA	Chapter title	2009	Book title	NA	281	286	Routledge	book	Jacobson, D. , 2009 . Chapter title . In: D. E. Davis & J. Go , eds. Book title .: Routledge , pp. 281 - 286 .
parsed_references	References	Costall, Alan	NA	NA	“Some article title”	1980	Theory and Psychology	1	123	145	NA	journal	Costall, Alan ( 1980 ). “Some article title” Theory and Psychology 1 : 123 – 145 .
parsed_references	References	Hudson, W.	NA	NA	Another article title	2000	Australian Journal of Cats & Dogs	40	134	150	NA	journal	Hudson, W. , 2000 . Another article title . Australian Journal of Cats & Dogs , September , 40 ( 3 ), p. 134 – 150 .
parsed_references	References	Fries-Britt, S.; Griffin, K.A.	NA	NA	Some article about race	2000	Journal of College Student Fun	20	60	120	NA	journal	Fries-Britt S. , & Griffin K.A. ( 2000 ). Some article about race . Journal of College Student Fun , 20 , 60 – 120 .

Note, that there might be other content present like endnotes, in case the article used endnotes rather than footnotes.

jst_get_footnotes

jst_get_footnotes(jst_example("article_with_references.xml")) %>% 
  kable()

file_name	footnotes
article_with_references	NA

Very commonly, articles either have footnotes or references. The sample file used here does not have footnotes, therefore a simple tibble with missing footnotes is returned.

I will use another file to demonstrate footnotes.

footnotes <- jst_get_footnotes(jst_example("article_with_footnotes.xml"))

footnotes %>% 
  mutate(footnotes = stringr::str_remove_all(footnotes, "\\\n")) %>% 
  kable()

file_name	footnotes
article_with_footnotes	[Footnotes]
article_with_footnotes	9Quarterly, vol. XIII, no. 1,entries for April 19 and 21.
article_with_footnotes	10Quarterly, vol. XIII,no. 1, p. 8.
article_with_footnotes	14Quarterly, vol. VIII, no. 1.Olympia Columbian, Sept. 11, 1852,
article_with_footnotes	26Quarterly, vol. XII,no. 2, p. 141.
article_with_footnotes	32Dr. David S. Maynard, later (March 31, 1852)
article_with_footnotes	34Thomas Linklater, Shepherd, since October 6, 1849,

In general, you might need to combine jst_get_footnotes() with jst_get_references() to get all available information on citation data.

jst_get_full_text

The function to extract full texts can’t be demonstrated with proper data, since the full texts are only supplied upon special request with DfR. The function guesses the encoding of the specified file via readr::guess_encoding(), reads the whole file and returns a tibble with file_name, full_text and encoding.

I created a file that looks similar to files supplied by DfR with sample text:

full_text <- jst_get_full_text(jst_example("full_text.txt"))
full_text %>% 
  mutate(full_text = stringr::str_remove_all(full_text, "\\\n")) %>% 
  kable()

file_name	full_text	encoding
full_text	Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laborisnisi ut aliquid ex ea commodi consequat. Quis aute iure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sintobcaecat cupiditat non proident, sunt in culpa qui officia deserunt mollit animid est laborum.	ASCII

Combining results

Different parts of meta-data can be combined by using dplyr::left_join().

Matching with authors

meta_data %>% 
  left_join(authors) %>%
  select(file_name, article_title, pub_year, given_name, surname) %>% 
  kable()
#> Joining, by = "file_name"

file_name	article_title	pub_year	given_name	surname
article_with_references	On the Protozoa Parasitic in Frogs	1922	R.	Kudo

Matching with references

meta_data %>% 
  left_join(references) %>% 
  select(file_name, article_title, volume, pub_year, ref_unparsed) %>%
  head(5) %>% 
  kable()
#> Joining, by = "file_name"

file_name	article_title	volume	pub_year	ref_unparsed
article_with_references	On the Protozoa Parasitic in Frogs	41	1922	DOBELL, C.C.1909 Researches on the intestinal Protozoa of frogs and toads. Quart. Jour. Micros.Sc., 53:201-276, 4 pl. and 1 textfig.
article_with_references	On the Protozoa Parasitic in Frogs	41	1922	1918 Are Entamoeba histolytica and Entamoeba ranarum the same species? An experi-mental study. Parasit., 10:294-310.
article_with_references	On the Protozoa Parasitic in Frogs	41	1922	KUDO, R.1920 Studies on Myxosporidia. A Synopsis of Genera and Species of Myxosporidia.ill. Biol. Monogr., 5:243-503, 25 pl. and 2 textfig.
article_with_references	On the Protozoa Parasitic in Frogs	41	1922	1921 On the nature of structures characteristic of Cnidosporidian spores. Trans.Micro. Soc., 40:60-74.
article_with_references	On the Protozoa Parasitic in Frogs	41	1922	1922 On the morphology and life history of a Myxosporidian, Leptotheca ohlmacheri,parasitic in Rana clamitans and Rana pipiens. Parasitology, 14, no. 2.

Books

Quite recently DfR added book chapters to their stack. To import metadata about the books and chapters, jstor supplies jst_get_book and jst_get_chapters.

jst_get_book is very similar to jst_get_article. We obtain general information about the complete book:

jst_get_book(jst_example("book.xml")) %>% knitr::kable()

book_id	file_name	discipline	book_title	book_subtitle	pub_day	pub_month	pub_year	isbn	publisher_name	publisher_location	n_pages	language
j.ctt24hdz7	book	Political Science	The 2006 Military Takeover in Fiji	A Coup to End All Coups?	30	4	2009	9781921536502; 9781921536519	ANU E Press	Canberra	NA	eng

A single book might contain many chapters. jst_get_chapters extracts all of them. Due to this, the function is a bit slower than most of jstor’s other functions.

chapters <- jst_get_chapters(jst_example("book.xml"))

str(chapters)
#> tibble [36 × 9] (S3: tbl_df/tbl/data.frame)
#>  $ book_id        : chr [1:36] "j.ctt24hdz7" "j.ctt24hdz7" "j.ctt24hdz7" "j.ctt24hdz7" ...
#>  $ file_name      : chr [1:36] "book" "book" "book" "book" ...
#>  $ part_id        : chr [1:36] "j.ctt24hdz7.1" "j.ctt24hdz7.2" "j.ctt24hdz7.3" "j.ctt24hdz7.4" ...
#>  $ part_label     : chr [1:36] NA NA NA NA ...
#>  $ part_title     : chr [1:36] "Front Matter" "Table of Contents" "Acronyms and abbreviations" "Authors’ biographies" ...
#>  $ part_subtitle  : chr [1:36] NA NA NA NA ...
#>  $ authors        : chr [1:36] NA NA NA NA ...
#>  $ abstract       : chr [1:36] NA NA NA NA ...
#>  $ part_first_page: chr [1:36] "i" "v" "vii" "xi" ...

Without the abstracts (they are rather long) the first 10 chapters look like this:

chapters %>% 
  select(-abstract) %>% 
  head(10) %>% 
  kable()

book_id	file_name	part_id	part_label	part_title	part_subtitle	authors	part_first_page
j.ctt24hdz7	book	j.ctt24hdz7.1	NA	Front Matter	NA	NA	i
j.ctt24hdz7	book	j.ctt24hdz7.2	NA	Table of Contents	NA	NA	v
j.ctt24hdz7	book	j.ctt24hdz7.3	NA	Acronyms and abbreviations	NA	NA	vii
j.ctt24hdz7	book	j.ctt24hdz7.4	NA	Authors’ biographies	NA	NA	xi
j.ctt24hdz7	book	j.ctt24hdz7.5	1.	The enigmas of Fiji’s good governance coup	NA	NA	3
j.ctt24hdz7	book	j.ctt24hdz7.6	2.	‘Anxiety, uncertainty and fear in our land’:	Fiji’s road to military coup, 2006	NA	21
j.ctt24hdz7	book	j.ctt24hdz7.7	3.	Fiji’s December 2006 coup:	Who, what, where and why?	NA	43
j.ctt24hdz7	book	j.ctt24hdz7.8	4.	‘This process of political readjustment’:	The aftermath of the 2006 Fiji Coup	NA	67
j.ctt24hdz7	book	j.ctt24hdz7.9	5.	The changing role of the Great Council of Chiefs	NA	NA	97
j.ctt24hdz7	book	j.ctt24hdz7.10	6.	The Fiji military and ethno-nationalism:	Analyzing the paradox	NA	117

Since extracting all authors for all chapters needs considerably more time, by default authors are not extracted. You can import them like so:

author_chap <- jst_get_chapters(jst_example("book.xml"), authors = TRUE)

The authors are supplied in a list column:

class(author_chap$authors)
#> [1] "list"

You can expand this list with tidyr::unnest:

author_chap %>% 
  tidyr::unnest(authors) %>% 
  select(part_id, given_name, surname) %>% 
  head(10) %>% 
  kable()

part_id	given_name	surname
j.ctt24hdz7.1	NA	NA
j.ctt24hdz7.2	NA	NA
j.ctt24hdz7.3	NA	NA
j.ctt24hdz7.4	NA	NA
j.ctt24hdz7.5	Jon	Fraenkel
j.ctt24hdz7.5	Stewart	Firth
j.ctt24hdz7.6	Brij V.	Lal
j.ctt24hdz7.7	Jon	Fraenkel
j.ctt24hdz7.8	Brij V.	Lal
j.ctt24hdz7.9	Robert	Norton

You can learn more about the concept of list-columns in Hadley Wickham’s book R for Data Science.

Introduction to jstor

Thomas Klebel

2023-08-15