| Title: | Approximate k-Nearest Neighbour Search for 'bigmemory' Matrices with Annoy |
|---|---|
| Description: | Approximate Euclidean k-nearest neighbour search routines that operate on 'bigmemory::big.matrix' data through Annoy indexes created with 'RcppAnnoy'. The package builds persistent on-disk indexes plus sidecar metadata from streamed 'big.matrix' rows, supports euclidean, angular, Manhattan, and dot-product Annoy metrics, and can either return in-memory results or stream neighbour indices and distances into destination 'bigmemory' matrices. Explicit index lifecycle helpers, stronger metadata validation, descriptor-aware file-backed workflows, and benchmark helpers are also included. |
| Authors: | Frederic Bertrand [aut, cre] |
| Maintainer: | Frederic Bertrand <[email protected]> |
| License: | GPL (>= 2) |
| Version: | 0.3.0 |
| Built: | 2026-06-01 11:07:35 UTC |
| Source: | https://github.com/fbertran/bigannoy |
bigmemory::big.matrix
Stream the rows of a reference bigmemory::big.matrix into an on-disk
Annoy index and write a small sidecar metadata file next to it. The returned
bigannoy_index can be reopened later with annoy_open_index().
annoy_build_bigmatrix( x, path, n_trees = 50L, metric = "euclidean", seed = NULL, build_threads = -1L, block_size = annoy_default_block_size(), metadata_path = NULL, load_mode = "lazy" )annoy_build_bigmatrix( x, path, n_trees = 50L, metric = "euclidean", seed = NULL, build_threads = -1L, block_size = annoy_default_block_size(), metadata_path = NULL, load_mode = "lazy" )
x |
A |
path |
File path where the Annoy index should be written. |
n_trees |
Number of Annoy trees to build. |
metric |
Distance metric. bigANNOY v2 supports |
seed |
Optional positive integer seed used to initialize Annoy's build RNG. |
build_threads |
Build-thread setting passed to Annoy's native backend.
Use |
block_size |
Number of rows processed per streamed block while building the index. |
metadata_path |
Optional path for the sidecar metadata file. Defaults to
|
load_mode |
Whether to keep the returned index metadata-only until first
search ( |
A bigannoy_index object describing the persisted Annoy index.
bigannoy_index
Close any loaded Annoy handle cached inside a bigannoy_index
annoy_close_index(index)annoy_close_index(index)
index |
A |
index, invisibly.
Check whether an index currently has a loaded in-memory handle
annoy_is_loaded(index)annoy_is_loaded(index)
index |
A |
TRUE when a live native or debug-only handle is cached, otherwise
FALSE.
Load an existing Annoy index for bigmatrix workflows
annoy_load_bigmatrix( path, metadata_path = NULL, prefault = FALSE, load_mode = "eager" )annoy_load_bigmatrix( path, metadata_path = NULL, prefault = FALSE, load_mode = "eager" )
path |
File path to an existing Annoy index built by
|
metadata_path |
Optional path to the sidecar metadata file. |
prefault |
Logical flag indicating whether searches should prefault the index when loaded by the native backend. |
load_mode |
Whether to eagerly load the native index handle on open or defer until first search. |
A bigannoy_index object that can be passed to
annoy_search_bigmatrix().
Open an existing Annoy index and its sidecar metadata
annoy_open_index( path, metadata_path = NULL, prefault = FALSE, load_mode = "eager" )annoy_open_index( path, metadata_path = NULL, prefault = FALSE, load_mode = "eager" )
path |
File path to an existing Annoy index built by
|
metadata_path |
Optional path to the sidecar metadata file. |
prefault |
Logical flag indicating whether searches should prefault the index when loaded by the native backend. |
load_mode |
Whether to eagerly load the native index handle on open or defer until first search. |
A bigannoy_index object that can be passed to
annoy_search_bigmatrix().
bigmemory::big.matrix
Query a persisted Annoy index created by annoy_build_bigmatrix() or
reopened with annoy_open_index(). Supply query = NULL for self-search
over the indexed reference rows, or provide a dense numeric matrix,
big.matrix, or external pointer for external-query search. Results can be
returned in memory or streamed into destination big.matrix objects.
annoy_search_bigmatrix( index, query = NULL, k = 10L, search_k = -1L, xpIndex = NULL, xpDistance = NULL, prefault = NULL, block_size = annoy_default_block_size() )annoy_search_bigmatrix( index, query = NULL, k = 10L, search_k = -1L, xpIndex = NULL, xpDistance = NULL, prefault = NULL, block_size = annoy_default_block_size() )
index |
A |
query |
Optional query source. Supply |
k |
Number of neighbours to return. |
search_k |
Annoy's runtime search budget. Use |
xpIndex |
Optional writable |
xpDistance |
Optional writable |
prefault |
Optional logical override controlling whether the native backend prefaults the Annoy file while loading it for search. |
block_size |
Number of queries processed per block. |
A list with components index, distance, k, metric, n_ref,
n_query, exact, and backend.
Validate a persisted Annoy index and its sidecar metadata
annoy_validate_index(index, strict = TRUE, load = TRUE, prefault = NULL)annoy_validate_index(index, strict = TRUE, load = TRUE, prefault = NULL)
index |
A |
strict |
Whether failed validation checks should raise an error. |
load |
Whether to also verify that the index can be loaded successfully. |
prefault |
Optional logical override used when |
A list containing valid, checks, and the normalized index.
Build or reuse a benchmark reference dataset, create an Annoy index, query
it, and optionally compare recall against the exact bigKNN Euclidean
baseline.
benchmark_annoy_bigmatrix( x = NULL, query = NULL, n_ref = 2000L, n_query = 200L, n_dim = 20L, k = 10L, n_trees = 50L, metric = "euclidean", search_k = -1L, seed = 42L, build_seed = seed, build_threads = -1L, block_size = annoy_default_block_size(), backend = getOption("bigANNOY.backend", "cpp"), exact = TRUE, filebacked = FALSE, path_dir = tempdir(), keep_files = FALSE, output_path = NULL, load_mode = "eager" )benchmark_annoy_bigmatrix( x = NULL, query = NULL, n_ref = 2000L, n_query = 200L, n_dim = 20L, k = 10L, n_trees = 50L, metric = "euclidean", search_k = -1L, seed = 42L, build_seed = seed, build_threads = -1L, block_size = annoy_default_block_size(), backend = getOption("bigANNOY.backend", "cpp"), exact = TRUE, filebacked = FALSE, path_dir = tempdir(), keep_files = FALSE, output_path = NULL, load_mode = "eager" )
x |
Optional benchmark reference input. Supply |
query |
Optional benchmark query input. Supply |
n_ref |
Number of synthetic reference rows to generate when |
n_query |
Number of synthetic query rows to generate when |
n_dim |
Number of synthetic columns to generate when |
k |
Number of neighbours to return. |
n_trees |
Number of Annoy trees to build. |
metric |
Annoy metric. One of |
search_k |
Annoy search budget. |
seed |
Random seed used for synthetic data generation and, by default, for the Annoy build seed. |
build_seed |
Optional Annoy build seed. Defaults to |
build_threads |
Native Annoy build-thread setting. |
block_size |
Build/search block size. |
backend |
Requested bigANNOY backend. |
exact |
Logical flag controlling whether to benchmark the exact
Euclidean baseline with |
filebacked |
Logical flag; if |
path_dir |
Directory where temporary Annoy and optional file-backed benchmark files should be written. |
keep_files |
Logical flag; if |
output_path |
Optional CSV path where the benchmark summary should be written. |
load_mode |
Whether the benchmarked index should be returned
metadata-only until first search ( |
A list with a one-row summary data frame plus the benchmark
parameters and generated Annoy file paths.
Run a grid of n_trees and search_k settings on the same benchmark
dataset, optionally recording recall against the exact bigKNN Euclidean
baseline.
benchmark_annoy_recall_suite( x = NULL, query = NULL, n_ref = 2000L, n_query = 200L, n_dim = 20L, k = 10L, n_trees = c(10L, 50L, 100L), search_k = c(-1L, 1000L, 5000L), metric = "euclidean", seed = 42L, build_seed = seed, build_threads = -1L, block_size = annoy_default_block_size(), backend = getOption("bigANNOY.backend", "cpp"), exact = TRUE, filebacked = FALSE, path_dir = tempdir(), keep_files = FALSE, output_path = NULL, load_mode = "eager" )benchmark_annoy_recall_suite( x = NULL, query = NULL, n_ref = 2000L, n_query = 200L, n_dim = 20L, k = 10L, n_trees = c(10L, 50L, 100L), search_k = c(-1L, 1000L, 5000L), metric = "euclidean", seed = 42L, build_seed = seed, build_threads = -1L, block_size = annoy_default_block_size(), backend = getOption("bigANNOY.backend", "cpp"), exact = TRUE, filebacked = FALSE, path_dir = tempdir(), keep_files = FALSE, output_path = NULL, load_mode = "eager" )
x |
Optional benchmark reference input. Supply |
query |
Optional benchmark query input. Supply |
n_ref |
Number of synthetic reference rows to generate when |
n_query |
Number of synthetic query rows to generate when |
n_dim |
Number of synthetic columns to generate when |
k |
Number of neighbours to return. |
n_trees |
Integer vector of Annoy tree counts to benchmark. |
search_k |
Integer vector of Annoy search budgets to benchmark. |
metric |
Annoy metric. One of |
seed |
Random seed used for synthetic data generation and, by default, for the Annoy build seed. |
build_seed |
Optional Annoy build seed. Defaults to |
build_threads |
Native Annoy build-thread setting. |
block_size |
Build/search block size. |
backend |
Requested bigANNOY backend. |
exact |
Logical flag controlling whether to benchmark the exact
Euclidean baseline with |
filebacked |
Logical flag; if |
path_dir |
Directory where temporary Annoy and optional file-backed benchmark files should be written. |
keep_files |
Logical flag; if |
output_path |
Optional CSV path where the benchmark summary should be written. |
load_mode |
Whether the benchmarked index should be returned
metadata-only until first search ( |
A list with a summary data frame containing one row per
(n_trees, search_k) configuration.
Run benchmark_annoy_vs_rcppannoy() over a grid of synthetic data sizes to
study how build time, search time, and index size scale with data volume.
benchmark_annoy_volume_suite( n_ref = c(2000L, 5000L, 10000L), n_query = 200L, n_dim = c(20L, 50L), k = 10L, n_trees = 50L, metric = "euclidean", search_k = -1L, seed = 42L, build_seed = seed, build_threads = -1L, block_size = annoy_default_block_size(), backend = getOption("bigANNOY.backend", "cpp"), exact = FALSE, filebacked = FALSE, path_dir = tempdir(), keep_files = FALSE, output_path = NULL, load_mode = "eager" )benchmark_annoy_volume_suite( n_ref = c(2000L, 5000L, 10000L), n_query = 200L, n_dim = c(20L, 50L), k = 10L, n_trees = 50L, metric = "euclidean", search_k = -1L, seed = 42L, build_seed = seed, build_threads = -1L, block_size = annoy_default_block_size(), backend = getOption("bigANNOY.backend", "cpp"), exact = FALSE, filebacked = FALSE, path_dir = tempdir(), keep_files = FALSE, output_path = NULL, load_mode = "eager" )
n_ref |
Integer vector of synthetic reference row counts. |
n_query |
Integer vector of synthetic query row counts. |
n_dim |
Integer vector of synthetic column counts. |
k |
Number of neighbours to return. |
n_trees |
Number of Annoy trees to build. |
metric |
Annoy metric. One of |
search_k |
Annoy search budget. |
seed |
Random seed used for synthetic data generation and, by default, for the Annoy build seed. |
build_seed |
Optional Annoy build seed. Defaults to |
build_threads |
Native Annoy build-thread setting. |
block_size |
Build/search block size. |
backend |
Requested bigANNOY backend. |
exact |
Logical flag controlling whether to benchmark the exact
Euclidean baseline with |
filebacked |
Logical flag; if |
path_dir |
Directory where temporary Annoy and optional file-backed benchmark files should be written. |
keep_files |
Logical flag; if |
output_path |
Optional CSV path where the benchmark summary should be written. |
load_mode |
Whether the benchmarked index should be returned
metadata-only until first search ( |
A list with a summary data frame containing one row per
implementation and data-volume combination.
Run the same Annoy build and search task through bigANNOY and through a
direct dense RcppAnnoy baseline. The comparison reports both speed metrics
and data-volume metrics such as reference bytes, query bytes, and generated
index size.
benchmark_annoy_vs_rcppannoy( x = NULL, query = NULL, n_ref = 2000L, n_query = 200L, n_dim = 20L, k = 10L, n_trees = 50L, metric = "euclidean", search_k = -1L, seed = 42L, build_seed = seed, build_threads = -1L, block_size = annoy_default_block_size(), backend = getOption("bigANNOY.backend", "cpp"), exact = TRUE, filebacked = FALSE, path_dir = tempdir(), keep_files = FALSE, output_path = NULL, load_mode = "eager" )benchmark_annoy_vs_rcppannoy( x = NULL, query = NULL, n_ref = 2000L, n_query = 200L, n_dim = 20L, k = 10L, n_trees = 50L, metric = "euclidean", search_k = -1L, seed = 42L, build_seed = seed, build_threads = -1L, block_size = annoy_default_block_size(), backend = getOption("bigANNOY.backend", "cpp"), exact = TRUE, filebacked = FALSE, path_dir = tempdir(), keep_files = FALSE, output_path = NULL, load_mode = "eager" )
x |
Optional benchmark reference input. Supply |
query |
Optional benchmark query input. Supply |
n_ref |
Number of synthetic reference rows to generate when |
n_query |
Number of synthetic query rows to generate when |
n_dim |
Number of synthetic columns to generate when |
k |
Number of neighbours to return. |
n_trees |
Number of Annoy trees to build. |
metric |
Annoy metric. One of |
search_k |
Annoy search budget. |
seed |
Random seed used for synthetic data generation and, by default, for the Annoy build seed. |
build_seed |
Optional Annoy build seed. Defaults to |
build_threads |
Native Annoy build-thread setting. |
block_size |
Build/search block size. |
backend |
Requested bigANNOY backend. |
exact |
Logical flag controlling whether to benchmark the exact
Euclidean baseline with |
filebacked |
Logical flag; if |
path_dir |
Directory where temporary Annoy and optional file-backed benchmark files should be written. |
keep_files |
Logical flag; if |
output_path |
Optional CSV path where the benchmark summary should be written. |
load_mode |
Whether the benchmarked index should be returned
metadata-only until first search ( |
A list with a two-row summary data frame, one row for bigANNOY
and one for direct RcppAnnoy, plus benchmark metadata and any validation
report produced for the bigANNOY index.
bigannoy_index
Print a bigannoy_index
## S3 method for class 'bigannoy_index' print(x, ...)## S3 method for class 'bigannoy_index' print(x, ...)
x |
A |
... |
Unused. |
x, invisibly.