Skip to contents

Downloads and returns PNS microdata for specified years from the IBGE FTP. Data is cached locally to avoid repeated downloads. When the arrow package is installed, data is cached in parquet format for faster subsequent reads.

Usage

pns_data(
  year = NULL,
  vars = NULL,
  cache_dir = NULL,
  refresh = FALSE,
  lazy = FALSE,
  backend = c("arrow", "duckdb")
)

Arguments

year

Numeric or vector. Year(s) to download (2013, 2019). Use NULL to download all available years. Default is NULL.

vars

Character vector. Variables to select. Use NULL for all variables. Default is NULL.

cache_dir

Character. Directory for caching downloaded files. Default uses tools::R_user_dir("healthbR", "cache").

refresh

Logical. If TRUE, re-download even if file exists in cache. Default is FALSE.

lazy

Logical. If TRUE, returns a lazy query object instead of a tibble. Requires the arrow package. The lazy object supports dplyr verbs (filter, select, mutate, etc.) which are pushed down to the query engine before collecting into memory. Call dplyr::collect() to materialize the result. Default: FALSE.

backend

Character. Backend for lazy evaluation: "arrow" (default) or "duckdb". Only used when lazy = TRUE. DuckDB backend requires the duckdb package.

Value

A tibble with PNS microdata.

Details

The PNS (Pesquisa Nacional de Saude) is a household survey conducted by IBGE in partnership with the Ministry of Health. It provides comprehensive data on health conditions, lifestyle, and healthcare access of the Brazilian population.

Survey design variables

For proper statistical analysis with complex survey design, use the following weight variables with the srvyr or survey packages:

  • V0028: household weight

  • V0029: selected person weight

  • V0030: person weight with non-response adjustment

  • UPA_PNS: primary sampling unit

  • V0024: stratum

Data source

Data is downloaded from the IBGE FTP server: https://ftp.ibge.gov.br/PNS/

Examples

if (FALSE) { # interactive()
# download PNS 2019 data
df <- pns_data(year = 2019, cache_dir = tempdir())

# download all years
df_all <- pns_data(cache_dir = tempdir())

# select specific variables
df_subset <- pns_data(
  year = 2019,
  vars = c("V0001", "C006", "C008", "V0028"),
  cache_dir = tempdir()
)
}