Browse code

Bump bolt to v1.1.0

It adds ARM64, ppc64le, s390x, solaris support, and a bunch of
bugfixs.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>

Qiang Huang authored on 2015/10/28 09:49:45
Showing 21 changed files
... ...
@@ -36,7 +36,7 @@ clone git github.com/coreos/etcd v2.2.0
36 36
 fix_rewritten_imports github.com/coreos/etcd
37 37
 clone git github.com/ugorji/go 5abd4e96a45c386928ed2ca2a7ef63e2533e18ec
38 38
 clone git github.com/hashicorp/consul v0.5.2
39
-clone git github.com/boltdb/bolt v1.0
39
+clone git github.com/boltdb/bolt v1.1.0
40 40
 
41 41
 # get graph and distribution packages
42 42
 clone git github.com/docker/distribution 20c4b7a1805a52753dfd593ee1cc35558722a0ce # docker/1.9 branch
... ...
@@ -1,3 +1,4 @@
1 1
 *.prof
2 2
 *.test
3
+*.swp
3 4
 /bin/
... ...
@@ -16,7 +16,7 @@ and setting values. That's it.
16 16
 
17 17
 ## Project Status
18 18
 
19
-Bolt is stable and the API is fixed. Full unit test coverage and randomized 
19
+Bolt is stable and the API is fixed. Full unit test coverage and randomized
20 20
 black box testing are used to ensure database consistency and thread safety.
21 21
 Bolt is currently in high-load production environments serving databases as
22 22
 large as 1TB. Many companies such as Shopify and Heroku use Bolt-backed
... ...
@@ -87,6 +87,11 @@ are not thread safe. To work with data in multiple goroutines you must start
87 87
 a transaction for each one or use locking to ensure only one goroutine accesses
88 88
 a transaction at a time. Creating transaction from the `DB` is thread safe.
89 89
 
90
+Read-only transactions and read-write transactions should not depend on one
91
+another and generally shouldn't be opened simultaneously in the same goroutine.
92
+This can cause a deadlock as the read-write transaction needs to periodically
93
+re-map the data file but it cannot do so while a read-only transaction is open.
94
+
90 95
 
91 96
 #### Read-write transactions
92 97
 
... ...
@@ -120,12 +125,88 @@ err := db.View(func(tx *bolt.Tx) error {
120 120
 })
121 121
 ```
122 122
 
123
-You also get a consistent view of the database within this closure, however, 
123
+You also get a consistent view of the database within this closure, however,
124 124
 no mutating operations are allowed within a read-only transaction. You can only
125 125
 retrieve buckets, retrieve values, and copy the database within a read-only
126 126
 transaction.
127 127
 
128 128
 
129
+#### Batch read-write transactions
130
+
131
+Each `DB.Update()` waits for disk to commit the writes. This overhead
132
+can be minimized by combining multiple updates with the `DB.Batch()`
133
+function:
134
+
135
+```go
136
+err := db.Batch(func(tx *bolt.Tx) error {
137
+	...
138
+	return nil
139
+})
140
+```
141
+
142
+Concurrent Batch calls are opportunistically combined into larger
143
+transactions. Batch is only useful when there are multiple goroutines
144
+calling it.
145
+
146
+The trade-off is that `Batch` can call the given
147
+function multiple times, if parts of the transaction fail. The
148
+function must be idempotent and side effects must take effect only
149
+after a successful return from `DB.Batch()`.
150
+
151
+For example: don't display messages from inside the function, instead
152
+set variables in the enclosing scope:
153
+
154
+```go
155
+var id uint64
156
+err := db.Batch(func(tx *bolt.Tx) error {
157
+	// Find last key in bucket, decode as bigendian uint64, increment
158
+	// by one, encode back to []byte, and add new key.
159
+	...
160
+	id = newValue
161
+	return nil
162
+})
163
+if err != nil {
164
+	return ...
165
+}
166
+fmt.Println("Allocated ID %d", id)
167
+```
168
+
169
+
170
+#### Managing transactions manually
171
+
172
+The `DB.View()` and `DB.Update()` functions are wrappers around the `DB.Begin()`
173
+function. These helper functions will start the transaction, execute a function,
174
+and then safely close your transaction if an error is returned. This is the
175
+recommended way to use Bolt transactions.
176
+
177
+However, sometimes you may want to manually start and end your transactions.
178
+You can use the `Tx.Begin()` function directly but _please_ be sure to close the
179
+transaction.
180
+
181
+```go
182
+// Start a writable transaction.
183
+tx, err := db.Begin(true)
184
+if err != nil {
185
+    return err
186
+}
187
+defer tx.Rollback()
188
+
189
+// Use the transaction...
190
+_, err := tx.CreateBucket([]byte("MyBucket"))
191
+if err != nil {
192
+    return err
193
+}
194
+
195
+// Commit the transaction and check for error.
196
+if err := tx.Commit(); err != nil {
197
+    return err
198
+}
199
+```
200
+
201
+The first argument to `DB.Begin()` is a boolean stating if the transaction
202
+should be writable.
203
+
204
+
129 205
 ### Using buckets
130 206
 
131 207
 Buckets are collections of key/value pairs within the database. All keys in a
... ...
@@ -175,13 +256,61 @@ db.View(func(tx *bolt.Tx) error {
175 175
 ```
176 176
 
177 177
 The `Get()` function does not return an error because its operation is
178
-guarenteed to work (unless there is some kind of system failure). If the key
178
+guaranteed to work (unless there is some kind of system failure). If the key
179 179
 exists then it will return its byte slice value. If it doesn't exist then it
180 180
 will return `nil`. It's important to note that you can have a zero-length value
181 181
 set to a key which is different than the key not existing.
182 182
 
183 183
 Use the `Bucket.Delete()` function to delete a key from the bucket.
184 184
 
185
+Please note that values returned from `Get()` are only valid while the
186
+transaction is open. If you need to use a value outside of the transaction
187
+then you must use `copy()` to copy it to another byte slice.
188
+
189
+
190
+### Autoincrementing integer for the bucket
191
+By using the NextSequence() function, you can let Bolt determine a sequence
192
+which can be used as the unique identifier for your key/value pairs. See the
193
+example below.
194
+
195
+```go
196
+// CreateUser saves u to the store. The new user ID is set on u once the data is persisted.
197
+func (s *Store) CreateUser(u *User) error {
198
+    return s.db.Update(func(tx *bolt.Tx) error {
199
+        // Retrieve the users bucket.
200
+        // This should be created when the DB is first opened.
201
+        b := tx.Bucket([]byte("users"))
202
+
203
+        // Generate ID for the user.
204
+        // This returns an error only if the Tx is closed or not writeable.
205
+        // That can't happen in an Update() call so I ignore the error check.
206
+        id, _ = b.NextSequence()
207
+        u.ID = int(id)
208
+
209
+        // Marshal user data into bytes.
210
+        buf, err := json.Marshal(u)
211
+        if err != nil {
212
+            return err
213
+        }
214
+
215
+        // Persist bytes to users bucket.
216
+        return b.Put(itob(u.ID), buf)
217
+    })
218
+}
219
+
220
+// itob returns an 8-byte big endian representation of v.
221
+func itob(v int) []byte {
222
+    b := make([]byte, 8)
223
+    binary.BigEndian.PutUint64(b, uint64(v))
224
+    return b
225
+}
226
+
227
+type User struct {
228
+    ID int
229
+    ...
230
+}
231
+
232
+```
185 233
 
186 234
 ### Iterating over keys
187 235
 
... ...
@@ -254,7 +383,7 @@ db.View(func(tx *bolt.Tx) error {
254 254
 	max := []byte("2000-01-01T00:00:00Z")
255 255
 
256 256
 	// Iterate over the 90's.
257
-	for k, v := c.Seek(min); k != nil && bytes.Compare(k, max) != -1; k, v = c.Next() {
257
+	for k, v := c.Seek(min); k != nil && bytes.Compare(k, max) <= 0; k, v = c.Next() {
258 258
 		fmt.Printf("%s: %s\n", k, v)
259 259
 	}
260 260
 
... ...
@@ -294,7 +423,7 @@ func (*Bucket) DeleteBucket(key []byte) error
294 294
 
295 295
 ### Database backups
296 296
 
297
-Bolt is a single file so it's easy to backup. You can use the `Tx.Copy()`
297
+Bolt is a single file so it's easy to backup. You can use the `Tx.WriteTo()`
298 298
 function to write a consistent view of the database to a writer. If you call
299 299
 this from a read-only transaction, it will perform a hot backup and not block
300 300
 your other database reads and writes. It will also use `O_DIRECT` when available
... ...
@@ -305,11 +434,12 @@ do database backups:
305 305
 
306 306
 ```go
307 307
 func BackupHandleFunc(w http.ResponseWriter, req *http.Request) {
308
-	err := db.View(func(tx bolt.Tx) error {
308
+	err := db.View(func(tx *bolt.Tx) error {
309 309
 		w.Header().Set("Content-Type", "application/octet-stream")
310 310
 		w.Header().Set("Content-Disposition", `attachment; filename="my.db"`)
311 311
 		w.Header().Set("Content-Length", strconv.Itoa(int(tx.Size())))
312
-		return tx.Copy(w)
312
+		_, err := tx.WriteTo(w)
313
+		return err
313 314
 	})
314 315
 	if err != nil {
315 316
 		http.Error(w, err.Error(), http.StatusInternalServerError)
... ...
@@ -351,14 +481,13 @@ go func() {
351 351
 		// Grab the current stats and diff them.
352 352
 		stats := db.Stats()
353 353
 		diff := stats.Sub(&prev)
354
-		
354
+
355 355
 		// Encode stats to JSON and print to STDERR.
356 356
 		json.NewEncoder(os.Stderr).Encode(diff)
357 357
 
358 358
 		// Save stats for the next loop.
359 359
 		prev = stats
360 360
 	}
361
-}
362 361
 }()
363 362
 ```
364 363
 
... ...
@@ -366,25 +495,83 @@ It's also useful to pipe these stats to a service such as statsd for monitoring
366 366
 or to provide an HTTP endpoint that will perform a fixed-length sample.
367 367
 
368 368
 
369
+### Read-Only Mode
370
+
371
+Sometimes it is useful to create a shared, read-only Bolt database. To this,
372
+set the `Options.ReadOnly` flag when opening your database. Read-only mode
373
+uses a shared lock to allow multiple processes to read from the database but
374
+it will block any processes from opening the database in read-write mode.
375
+
376
+```go
377
+db, err := bolt.Open("my.db", 0666, &bolt.Options{ReadOnly: true})
378
+if err != nil {
379
+	log.Fatal(err)
380
+}
381
+```
382
+
383
+
369 384
 ## Resources
370 385
 
371 386
 For more information on getting started with Bolt, check out the following articles:
372 387
 
373 388
 * [Intro to BoltDB: Painless Performant Persistence](http://npf.io/2014/07/intro-to-boltdb-painless-performant-persistence/) by [Nate Finch](https://github.com/natefinch).
389
+* [Bolt -- an embedded key/value database for Go](https://www.progville.com/go/bolt-embedded-db-golang/) by Progville
390
+
391
+
392
+## Comparison with other databases
393
+
394
+### Postgres, MySQL, & other relational databases
395
+
396
+Relational databases structure data into rows and are only accessible through
397
+the use of SQL. This approach provides flexibility in how you store and query
398
+your data but also incurs overhead in parsing and planning SQL statements. Bolt
399
+accesses all data by a byte slice key. This makes Bolt fast to read and write
400
+data by key but provides no built-in support for joining values together.
401
+
402
+Most relational databases (with the exception of SQLite) are standalone servers
403
+that run separately from your application. This gives your systems
404
+flexibility to connect multiple application servers to a single database
405
+server but also adds overhead in serializing and transporting data over the
406
+network. Bolt runs as a library included in your application so all data access
407
+has to go through your application's process. This brings data closer to your
408
+application but limits multi-process access to the data.
409
+
410
+
411
+### LevelDB, RocksDB
374 412
 
413
+LevelDB and its derivatives (RocksDB, HyperLevelDB) are similar to Bolt in that
414
+they are libraries bundled into the application, however, their underlying
415
+structure is a log-structured merge-tree (LSM tree). An LSM tree optimizes
416
+random writes by using a write ahead log and multi-tiered, sorted files called
417
+SSTables. Bolt uses a B+tree internally and only a single file. Both approaches
418
+have trade offs.
375 419
 
420
+If you require a high random write throughput (>10,000 w/sec) or you need to use
421
+spinning disks then LevelDB could be a good choice. If your application is
422
+read-heavy or does a lot of range scans then Bolt could be a good choice.
376 423
 
377
-## Comparing Bolt to LMDB
424
+One other important consideration is that LevelDB does not have transactions.
425
+It supports batch writing of key/values pairs and it supports read snapshots
426
+but it will not give you the ability to do a compare-and-swap operation safely.
427
+Bolt supports fully serializable ACID transactions.
428
+
429
+
430
+### LMDB
378 431
 
379 432
 Bolt was originally a port of LMDB so it is architecturally similar. Both use
380
-a B+tree, have ACID semanetics with fully serializable transactions, and support
433
+a B+tree, have ACID semantics with fully serializable transactions, and support
381 434
 lock-free MVCC using a single writer and multiple readers.
382 435
 
383 436
 The two projects have somewhat diverged. LMDB heavily focuses on raw performance
384 437
 while Bolt has focused on simplicity and ease of use. For example, LMDB allows
385
-several unsafe actions such as direct writes and append writes for the sake of
386
-performance. Bolt opts to disallow actions which can leave the database in a 
387
-corrupted state. The only exception to this in Bolt is `DB.NoSync`.
438
+several unsafe actions such as direct writes for the sake of performance. Bolt
439
+opts to disallow actions which can leave the database in a corrupted state. The
440
+only exception to this in Bolt is `DB.NoSync`.
441
+
442
+There are also a few differences in API. LMDB requires a maximum mmap size when
443
+opening an `mdb_env` whereas Bolt will handle incremental mmap resizing
444
+automatically. LMDB overloads the getter and setter functions with multiple
445
+flags whereas Bolt splits these specialized cases into their own functions.
388 446
 
389 447
 
390 448
 ## Caveats & Limitations
... ...
@@ -425,14 +612,33 @@ Here are a few things to note when evaluating and using Bolt:
425 425
   can in memory and will release memory as needed to other processes. This means
426 426
   that Bolt can show very high memory usage when working with large databases.
427 427
   However, this is expected and the OS will release memory as needed. Bolt can
428
-  handle databases much larger than the available physical RAM.
428
+  handle databases much larger than the available physical RAM, provided its
429
+  memory-map fits in the process virtual address space. It may be problematic
430
+  on 32-bits systems.
431
+
432
+* The data structures in the Bolt database are memory mapped so the data file
433
+  will be endian specific. This means that you cannot copy a Bolt file from a
434
+  little endian machine to a big endian machine and have it work. For most 
435
+  users this is not a concern since most modern CPUs are little endian.
436
+
437
+* Because of the way pages are laid out on disk, Bolt cannot truncate data files
438
+  and return free pages back to the disk. Instead, Bolt maintains a free list
439
+  of unused pages within its data file. These free pages can be reused by later
440
+  transactions. This works well for many use cases as databases generally tend
441
+  to grow. However, it's important to note that deleting large chunks of data
442
+  will not allow you to reclaim that space on disk.
443
+
444
+  For more information on page allocation, [see this comment][page-allocation].
445
+
446
+[page-allocation]: https://github.com/boltdb/bolt/issues/308#issuecomment-74811638
429 447
 
430 448
 
431 449
 ## Other Projects Using Bolt
432 450
 
433 451
 Below is a list of public, open source projects that use Bolt:
434 452
 
435
-* [Bazil](https://github.com/bazillion/bazil) - A file system that lets your data reside where it is most convenient for it to reside.
453
+* [Operation Go: A Routine Mission](http://gocode.io) - An online programming game for Golang using Bolt for user accounts and a leaderboard.
454
+* [Bazil](https://bazil.org/) - A file system that lets your data reside where it is most convenient for it to reside.
436 455
 * [DVID](https://github.com/janelia-flyem/dvid) - Added Bolt as optional storage engine and testing it against Basho-tuned leveldb.
437 456
 * [Skybox Analytics](https://github.com/skybox/skybox) - A standalone funnel analysis tool for web analytics.
438 457
 * [Scuttlebutt](https://github.com/benbjohnson/scuttlebutt) - Uses Bolt to store and process all Twitter mentions of GitHub projects.
... ...
@@ -450,6 +656,16 @@ Below is a list of public, open source projects that use Bolt:
450 450
 * [bleve](http://www.blevesearch.com/) - A pure Go search engine similar to ElasticSearch that uses Bolt as the default storage backend.
451 451
 * [tentacool](https://github.com/optiflows/tentacool) - REST api server to manage system stuff (IP, DNS, Gateway...) on a linux server.
452 452
 * [SkyDB](https://github.com/skydb/sky) - Behavioral analytics database.
453
+* [Seaweed File System](https://github.com/chrislusf/weed-fs) - Highly scalable distributed key~file system with O(1) disk read.
454
+* [InfluxDB](http://influxdb.com) - Scalable datastore for metrics, events, and real-time analytics.
455
+* [Freehold](http://tshannon.bitbucket.org/freehold/) - An open, secure, and lightweight platform for your files and data.
456
+* [Prometheus Annotation Server](https://github.com/oliver006/prom_annotation_server) - Annotation server for PromDash & Prometheus service monitoring system.
457
+* [Consul](https://github.com/hashicorp/consul) - Consul is service discovery and configuration made easy. Distributed, highly available, and datacenter-aware.
458
+* [Kala](https://github.com/ajvb/kala) - Kala is a modern job scheduler optimized to run on a single node. It is persistent, JSON over HTTP API, ISO 8601 duration notation, and dependent jobs.
459
+* [drive](https://github.com/odeke-em/drive) - drive is an unofficial Google Drive command line client for \*NIX operating systems.
460
+* [stow](https://github.com/djherbis/stow) -  a persistence manager for objects
461
+  backed by boltdb.
462
+* [buckets](https://github.com/joyrexus/buckets) - a bolt wrapper streamlining
463
+  simple tx and key scans.
453 464
 
454 465
 If you are using Bolt in a project please send a pull request to add it to the list.
455
-
456 466
new file mode 100644
... ...
@@ -0,0 +1,138 @@
0
+package bolt
1
+
2
+import (
3
+	"errors"
4
+	"fmt"
5
+	"sync"
6
+	"time"
7
+)
8
+
9
+// Batch calls fn as part of a batch. It behaves similar to Update,
10
+// except:
11
+//
12
+// 1. concurrent Batch calls can be combined into a single Bolt
13
+// transaction.
14
+//
15
+// 2. the function passed to Batch may be called multiple times,
16
+// regardless of whether it returns error or not.
17
+//
18
+// This means that Batch function side effects must be idempotent and
19
+// take permanent effect only after a successful return is seen in
20
+// caller.
21
+//
22
+// The maximum batch size and delay can be adjusted with DB.MaxBatchSize
23
+// and DB.MaxBatchDelay, respectively.
24
+//
25
+// Batch is only useful when there are multiple goroutines calling it.
26
+func (db *DB) Batch(fn func(*Tx) error) error {
27
+	errCh := make(chan error, 1)
28
+
29
+	db.batchMu.Lock()
30
+	if (db.batch == nil) || (db.batch != nil && len(db.batch.calls) >= db.MaxBatchSize) {
31
+		// There is no existing batch, or the existing batch is full; start a new one.
32
+		db.batch = &batch{
33
+			db: db,
34
+		}
35
+		db.batch.timer = time.AfterFunc(db.MaxBatchDelay, db.batch.trigger)
36
+	}
37
+	db.batch.calls = append(db.batch.calls, call{fn: fn, err: errCh})
38
+	if len(db.batch.calls) >= db.MaxBatchSize {
39
+		// wake up batch, it's ready to run
40
+		go db.batch.trigger()
41
+	}
42
+	db.batchMu.Unlock()
43
+
44
+	err := <-errCh
45
+	if err == trySolo {
46
+		err = db.Update(fn)
47
+	}
48
+	return err
49
+}
50
+
51
+type call struct {
52
+	fn  func(*Tx) error
53
+	err chan<- error
54
+}
55
+
56
+type batch struct {
57
+	db    *DB
58
+	timer *time.Timer
59
+	start sync.Once
60
+	calls []call
61
+}
62
+
63
+// trigger runs the batch if it hasn't already been run.
64
+func (b *batch) trigger() {
65
+	b.start.Do(b.run)
66
+}
67
+
68
+// run performs the transactions in the batch and communicates results
69
+// back to DB.Batch.
70
+func (b *batch) run() {
71
+	b.db.batchMu.Lock()
72
+	b.timer.Stop()
73
+	// Make sure no new work is added to this batch, but don't break
74
+	// other batches.
75
+	if b.db.batch == b {
76
+		b.db.batch = nil
77
+	}
78
+	b.db.batchMu.Unlock()
79
+
80
+retry:
81
+	for len(b.calls) > 0 {
82
+		var failIdx = -1
83
+		err := b.db.Update(func(tx *Tx) error {
84
+			for i, c := range b.calls {
85
+				if err := safelyCall(c.fn, tx); err != nil {
86
+					failIdx = i
87
+					return err
88
+				}
89
+			}
90
+			return nil
91
+		})
92
+
93
+		if failIdx >= 0 {
94
+			// take the failing transaction out of the batch. it's
95
+			// safe to shorten b.calls here because db.batch no longer
96
+			// points to us, and we hold the mutex anyway.
97
+			c := b.calls[failIdx]
98
+			b.calls[failIdx], b.calls = b.calls[len(b.calls)-1], b.calls[:len(b.calls)-1]
99
+			// tell the submitter re-run it solo, continue with the rest of the batch
100
+			c.err <- trySolo
101
+			continue retry
102
+		}
103
+
104
+		// pass success, or bolt internal errors, to all callers
105
+		for _, c := range b.calls {
106
+			if c.err != nil {
107
+				c.err <- err
108
+			}
109
+		}
110
+		break retry
111
+	}
112
+}
113
+
114
+// trySolo is a special sentinel error value used for signaling that a
115
+// transaction function should be re-run. It should never be seen by
116
+// callers.
117
+var trySolo = errors.New("batch function returned an error and should be re-run solo")
118
+
119
+type panicked struct {
120
+	reason interface{}
121
+}
122
+
123
+func (p panicked) Error() string {
124
+	if err, ok := p.reason.(error); ok {
125
+		return err.Error()
126
+	}
127
+	return fmt.Sprintf("panic: %v", p.reason)
128
+}
129
+
130
+func safelyCall(fn func(*Tx) error, tx *Tx) (err error) {
131
+	defer func() {
132
+		if p := recover(); p != nil {
133
+			err = panicked{p}
134
+		}
135
+	}()
136
+	return fn(tx)
137
+}
... ...
@@ -1,4 +1,7 @@
1 1
 package bolt
2 2
 
3 3
 // maxMapSize represents the largest mmap size supported by Bolt.
4
-const maxMapSize = 0xFFFFFFF // 256MB
4
+const maxMapSize = 0x7FFFFFFF // 2GB
5
+
6
+// maxAllocSize is the size used when creating array pointers.
7
+const maxAllocSize = 0xFFFFFFF
... ...
@@ -2,3 +2,6 @@ package bolt
2 2
 
3 3
 // maxMapSize represents the largest mmap size supported by Bolt.
4 4
 const maxMapSize = 0xFFFFFFFFFFFF // 256TB
5
+
6
+// maxAllocSize is the size used when creating array pointers.
7
+const maxAllocSize = 0x7FFFFFFF
... ...
@@ -1,4 +1,7 @@
1 1
 package bolt
2 2
 
3 3
 // maxMapSize represents the largest mmap size supported by Bolt.
4
-const maxMapSize = 0xFFFFFFF // 256MB
4
+const maxMapSize = 0x7FFFFFFF // 2GB
5
+
6
+// maxAllocSize is the size used when creating array pointers.
7
+const maxAllocSize = 0xFFFFFFF
5 8
new file mode 100644
... ...
@@ -0,0 +1,9 @@
0
+// +build arm64
1
+
2
+package bolt
3
+
4
+// maxMapSize represents the largest mmap size supported by Bolt.
5
+const maxMapSize = 0xFFFFFFFFFFFF // 256TB
6
+
7
+// maxAllocSize is the size used when creating array pointers.
8
+const maxAllocSize = 0x7FFFFFFF
0 9
new file mode 100644
... ...
@@ -0,0 +1,9 @@
0
+// +build ppc64le
1
+
2
+package bolt
3
+
4
+// maxMapSize represents the largest mmap size supported by Bolt.
5
+const maxMapSize = 0xFFFFFFFFFFFF // 256TB
6
+
7
+// maxAllocSize is the size used when creating array pointers.
8
+const maxAllocSize = 0x7FFFFFFF
0 9
new file mode 100644
... ...
@@ -0,0 +1,9 @@
0
+// +build s390x
1
+
2
+package bolt
3
+
4
+// maxMapSize represents the largest mmap size supported by Bolt.
5
+const maxMapSize = 0xFFFFFFFFFFFF // 256TB
6
+
7
+// maxAllocSize is the size used when creating array pointers.
8
+const maxAllocSize = 0x7FFFFFFF
... ...
@@ -1,8 +1,9 @@
1
-// +build !windows,!plan9
1
+// +build !windows,!plan9,!solaris
2 2
 
3 3
 package bolt
4 4
 
5 5
 import (
6
+	"fmt"
6 7
 	"os"
7 8
 	"syscall"
8 9
 	"time"
... ...
@@ -10,7 +11,7 @@ import (
10 10
 )
11 11
 
12 12
 // flock acquires an advisory lock on a file descriptor.
13
-func flock(f *os.File, timeout time.Duration) error {
13
+func flock(f *os.File, exclusive bool, timeout time.Duration) error {
14 14
 	var t time.Time
15 15
 	for {
16 16
 		// If we're beyond our timeout then return an error.
... ...
@@ -20,9 +21,13 @@ func flock(f *os.File, timeout time.Duration) error {
20 20
 		} else if timeout > 0 && time.Since(t) > timeout {
21 21
 			return ErrTimeout
22 22
 		}
23
+		flag := syscall.LOCK_SH
24
+		if exclusive {
25
+			flag = syscall.LOCK_EX
26
+		}
23 27
 
24 28
 		// Otherwise attempt to obtain an exclusive lock.
25
-		err := syscall.Flock(int(f.Fd()), syscall.LOCK_EX|syscall.LOCK_NB)
29
+		err := syscall.Flock(int(f.Fd()), flag|syscall.LOCK_NB)
26 30
 		if err == nil {
27 31
 			return nil
28 32
 		} else if err != syscall.EWOULDBLOCK {
... ...
@@ -41,11 +46,28 @@ func funlock(f *os.File) error {
41 41
 
42 42
 // mmap memory maps a DB's data file.
43 43
 func mmap(db *DB, sz int) error {
44
+	// Truncate and fsync to ensure file size metadata is flushed.
45
+	// https://github.com/boltdb/bolt/issues/284
46
+	if !db.NoGrowSync && !db.readOnly {
47
+		if err := db.file.Truncate(int64(sz)); err != nil {
48
+			return fmt.Errorf("file resize error: %s", err)
49
+		}
50
+		if err := db.file.Sync(); err != nil {
51
+			return fmt.Errorf("file sync error: %s", err)
52
+		}
53
+	}
54
+
55
+	// Map the data file to memory.
44 56
 	b, err := syscall.Mmap(int(db.file.Fd()), 0, sz, syscall.PROT_READ, syscall.MAP_SHARED)
45 57
 	if err != nil {
46 58
 		return err
47 59
 	}
48 60
 
61
+	// Advise the kernel that the mmap is accessed randomly.
62
+	if err := madvise(b, syscall.MADV_RANDOM); err != nil {
63
+		return fmt.Errorf("madvise: %s", err)
64
+	}
65
+
49 66
 	// Save the original byte slice and convert to a byte array pointer.
50 67
 	db.dataref = b
51 68
 	db.data = (*[maxMapSize]byte)(unsafe.Pointer(&b[0]))
... ...
@@ -67,3 +89,12 @@ func munmap(db *DB) error {
67 67
 	db.datasz = 0
68 68
 	return err
69 69
 }
70
+
71
+// NOTE: This function is copied from stdlib because it is not available on darwin.
72
+func madvise(b []byte, advice int) (err error) {
73
+	_, _, e1 := syscall.Syscall(syscall.SYS_MADVISE, uintptr(unsafe.Pointer(&b[0])), uintptr(len(b)), uintptr(advice))
74
+	if e1 != 0 {
75
+		err = e1
76
+	}
77
+	return
78
+}
70 79
new file mode 100644
... ...
@@ -0,0 +1,101 @@
0
+
1
+package bolt
2
+
3
+import (
4
+	"fmt"
5
+	"os"
6
+	"syscall"
7
+	"time"
8
+	"unsafe"
9
+	"golang.org/x/sys/unix"
10
+)
11
+
12
+// flock acquires an advisory lock on a file descriptor.
13
+func flock(f *os.File, exclusive bool, timeout time.Duration) error {
14
+	var t time.Time
15
+	for {
16
+		// If we're beyond our timeout then return an error.
17
+		// This can only occur after we've attempted a flock once.
18
+		if t.IsZero() {
19
+			t = time.Now()
20
+		} else if timeout > 0 && time.Since(t) > timeout {
21
+			return ErrTimeout
22
+		}
23
+		var lock syscall.Flock_t
24
+		lock.Start = 0
25
+		lock.Len = 0
26
+		lock.Pid = 0
27
+		lock.Whence = 0
28
+		lock.Pid = 0
29
+		if exclusive {
30
+			lock.Type = syscall.F_WRLCK
31
+		} else {
32
+			lock.Type = syscall.F_RDLCK
33
+		}
34
+		err := syscall.FcntlFlock(f.Fd(), syscall.F_SETLK, &lock)
35
+		if err == nil {
36
+			return nil
37
+		} else if err != syscall.EAGAIN {
38
+			return err
39
+		}
40
+
41
+		// Wait for a bit and try again.
42
+		time.Sleep(50 * time.Millisecond)
43
+	}
44
+}
45
+
46
+// funlock releases an advisory lock on a file descriptor.
47
+func funlock(f *os.File) error {
48
+	var lock syscall.Flock_t
49
+	lock.Start = 0
50
+	lock.Len = 0
51
+	lock.Type = syscall.F_UNLCK
52
+	lock.Whence = 0
53
+	return syscall.FcntlFlock(uintptr(f.Fd()), syscall.F_SETLK, &lock)
54
+}
55
+
56
+// mmap memory maps a DB's data file.
57
+func mmap(db *DB, sz int) error {
58
+	// Truncate and fsync to ensure file size metadata is flushed.
59
+	// https://github.com/boltdb/bolt/issues/284
60
+	if !db.NoGrowSync && !db.readOnly {
61
+		if err := db.file.Truncate(int64(sz)); err != nil {
62
+			return fmt.Errorf("file resize error: %s", err)
63
+		}
64
+		if err := db.file.Sync(); err != nil {
65
+			return fmt.Errorf("file sync error: %s", err)
66
+		}
67
+	}
68
+
69
+	// Map the data file to memory.
70
+	b, err := unix.Mmap(int(db.file.Fd()), 0, sz, syscall.PROT_READ, syscall.MAP_SHARED)
71
+	if err != nil {
72
+		return err
73
+	}
74
+
75
+	// Advise the kernel that the mmap is accessed randomly.
76
+	if err := unix.Madvise(b, syscall.MADV_RANDOM); err != nil {
77
+		return fmt.Errorf("madvise: %s", err)
78
+	}
79
+
80
+	// Save the original byte slice and convert to a byte array pointer.
81
+	db.dataref = b
82
+	db.data = (*[maxMapSize]byte)(unsafe.Pointer(&b[0]))
83
+	db.datasz = sz
84
+	return nil
85
+}
86
+
87
+// munmap unmaps a DB's data file from memory.
88
+func munmap(db *DB) error {
89
+	// Ignore the unmap if we have no mapped data.
90
+	if db.dataref == nil {
91
+		return nil
92
+	}
93
+
94
+	// Unmap using the original byte slice.
95
+	err := unix.Munmap(db.dataref)
96
+	db.dataref = nil
97
+	db.data = nil
98
+	db.datasz = 0
99
+	return err
100
+}
... ...
@@ -16,7 +16,7 @@ func fdatasync(db *DB) error {
16 16
 }
17 17
 
18 18
 // flock acquires an advisory lock on a file descriptor.
19
-func flock(f *os.File, _ time.Duration) error {
19
+func flock(f *os.File, _ bool, _ time.Duration) error {
20 20
 	return nil
21 21
 }
22 22
 
... ...
@@ -28,9 +28,11 @@ func funlock(f *os.File) error {
28 28
 // mmap memory maps a DB's data file.
29 29
 // Based on: https://github.com/edsrzf/mmap-go
30 30
 func mmap(db *DB, sz int) error {
31
-	// Truncate the database to the size of the mmap.
32
-	if err := db.file.Truncate(int64(sz)); err != nil {
33
-		return fmt.Errorf("truncate: %s", err)
31
+	if !db.readOnly {
32
+		// Truncate the database to the size of the mmap.
33
+		if err := db.file.Truncate(int64(sz)); err != nil {
34
+			return fmt.Errorf("truncate: %s", err)
35
+		}
34 36
 	}
35 37
 
36 38
 	// Open a file mapping handle.
... ...
@@ -99,6 +99,7 @@ func (b *Bucket) Cursor() *Cursor {
99 99
 
100 100
 // Bucket retrieves a nested bucket by name.
101 101
 // Returns nil if the bucket does not exist.
102
+// The bucket instance is only valid for the lifetime of the transaction.
102 103
 func (b *Bucket) Bucket(name []byte) *Bucket {
103 104
 	if b.buckets != nil {
104 105
 		if child := b.buckets[string(name)]; child != nil {
... ...
@@ -148,6 +149,7 @@ func (b *Bucket) openBucket(value []byte) *Bucket {
148 148
 
149 149
 // CreateBucket creates a new bucket at the given key and returns the new bucket.
150 150
 // Returns an error if the key already exists, if the bucket name is blank, or if the bucket name is too long.
151
+// The bucket instance is only valid for the lifetime of the transaction.
151 152
 func (b *Bucket) CreateBucket(key []byte) (*Bucket, error) {
152 153
 	if b.tx.db == nil {
153 154
 		return nil, ErrTxClosed
... ...
@@ -192,6 +194,7 @@ func (b *Bucket) CreateBucket(key []byte) (*Bucket, error) {
192 192
 
193 193
 // CreateBucketIfNotExists creates a new bucket if it doesn't already exist and returns a reference to it.
194 194
 // Returns an error if the bucket name is blank, or if the bucket name is too long.
195
+// The bucket instance is only valid for the lifetime of the transaction.
195 196
 func (b *Bucket) CreateBucketIfNotExists(key []byte) (*Bucket, error) {
196 197
 	child, err := b.CreateBucket(key)
197 198
 	if err == ErrBucketExists {
... ...
@@ -252,6 +255,7 @@ func (b *Bucket) DeleteBucket(key []byte) error {
252 252
 
253 253
 // Get retrieves the value for a key in the bucket.
254 254
 // Returns a nil value if the key does not exist or if the key is a nested bucket.
255
+// The returned value is only valid for the life of the transaction.
255 256
 func (b *Bucket) Get(key []byte) []byte {
256 257
 	k, v, flags := b.Cursor().seek(key)
257 258
 
... ...
@@ -332,6 +336,12 @@ func (b *Bucket) NextSequence() (uint64, error) {
332 332
 		return 0, ErrTxNotWritable
333 333
 	}
334 334
 
335
+	// Materialize the root node if it hasn't been already so that the
336
+	// bucket will be saved during commit.
337
+	if b.rootNode == nil {
338
+		_ = b.node(b.root, nil)
339
+	}
340
+
335 341
 	// Increment and return the sequence.
336 342
 	b.bucket.sequence++
337 343
 	return b.bucket.sequence, nil
... ...
@@ -339,7 +349,8 @@ func (b *Bucket) NextSequence() (uint64, error) {
339 339
 
340 340
 // ForEach executes a function for each key/value pair in a bucket.
341 341
 // If the provided function returns an error then the iteration is stopped and
342
-// the error is returned to the caller.
342
+// the error is returned to the caller. The provided function must not modify
343
+// the bucket; this will result in undefined behavior.
343 344
 func (b *Bucket) ForEach(fn func(k, v []byte) error) error {
344 345
 	if b.tx.db == nil {
345 346
 		return ErrTxClosed
... ...
@@ -511,8 +522,12 @@ func (b *Bucket) spill() error {
511 511
 		// Update parent node.
512 512
 		var c = b.Cursor()
513 513
 		k, _, flags := c.seek([]byte(name))
514
-		_assert(bytes.Equal([]byte(name), k), "misplaced bucket header: %x -> %x", []byte(name), k)
515
-		_assert(flags&bucketLeafFlag != 0, "unexpected bucket header flag: %x", flags)
514
+		if !bytes.Equal([]byte(name), k) {
515
+			panic(fmt.Sprintf("misplaced bucket header: %x -> %x", []byte(name), k))
516
+		}
517
+		if flags&bucketLeafFlag == 0 {
518
+			panic(fmt.Sprintf("unexpected bucket header flag: %x", flags))
519
+		}
516 520
 		c.node().put([]byte(name), []byte(name), value, 0, bucketLeafFlag)
517 521
 	}
518 522
 
... ...
@@ -528,7 +543,9 @@ func (b *Bucket) spill() error {
528 528
 	b.rootNode = b.rootNode.root()
529 529
 
530 530
 	// Update the root node for this bucket.
531
-	_assert(b.rootNode.pgid < b.tx.meta.pgid, "pgid (%d) above high water mark (%d)", b.rootNode.pgid, b.tx.meta.pgid)
531
+	if b.rootNode.pgid >= b.tx.meta.pgid {
532
+		panic(fmt.Sprintf("pgid (%d) above high water mark (%d)", b.rootNode.pgid, b.tx.meta.pgid))
533
+	}
532 534
 	b.root = b.rootNode.pgid
533 535
 
534 536
 	return nil
... ...
@@ -659,7 +676,9 @@ func (b *Bucket) pageNode(id pgid) (*page, *node) {
659 659
 	// Inline buckets have a fake page embedded in their value so treat them
660 660
 	// differently. We'll return the rootNode (if available) or the fake page.
661 661
 	if b.root == 0 {
662
-		_assert(id == 0, "inline bucket non-zero page access(2): %d != 0", id)
662
+		if id != 0 {
663
+			panic(fmt.Sprintf("inline bucket non-zero page access(2): %d != 0", id))
664
+		}
663 665
 		if b.rootNode != nil {
664 666
 			return nil, b.rootNode
665 667
 		}
... ...
@@ -2,6 +2,7 @@ package bolt
2 2
 
3 3
 import (
4 4
 	"bytes"
5
+	"fmt"
5 6
 	"sort"
6 7
 )
7 8
 
... ...
@@ -9,6 +10,8 @@ import (
9 9
 // Cursors see nested buckets with value == nil.
10 10
 // Cursors can be obtained from a transaction and are valid as long as the transaction is open.
11 11
 //
12
+// Keys and values returned from the cursor are only valid for the life of the transaction.
13
+//
12 14
 // Changing data while traversing with a cursor may cause it to be invalidated
13 15
 // and return unexpected keys and/or values. You must reposition your cursor
14 16
 // after mutating data.
... ...
@@ -24,6 +27,7 @@ func (c *Cursor) Bucket() *Bucket {
24 24
 
25 25
 // First moves the cursor to the first item in the bucket and returns its key and value.
26 26
 // If the bucket is empty then a nil key and value are returned.
27
+// The returned key and value are only valid for the life of the transaction.
27 28
 func (c *Cursor) First() (key []byte, value []byte) {
28 29
 	_assert(c.bucket.tx.db != nil, "tx closed")
29 30
 	c.stack = c.stack[:0]
... ...
@@ -40,6 +44,7 @@ func (c *Cursor) First() (key []byte, value []byte) {
40 40
 
41 41
 // Last moves the cursor to the last item in the bucket and returns its key and value.
42 42
 // If the bucket is empty then a nil key and value are returned.
43
+// The returned key and value are only valid for the life of the transaction.
43 44
 func (c *Cursor) Last() (key []byte, value []byte) {
44 45
 	_assert(c.bucket.tx.db != nil, "tx closed")
45 46
 	c.stack = c.stack[:0]
... ...
@@ -57,6 +62,7 @@ func (c *Cursor) Last() (key []byte, value []byte) {
57 57
 
58 58
 // Next moves the cursor to the next item in the bucket and returns its key and value.
59 59
 // If the cursor is at the end of the bucket then a nil key and value are returned.
60
+// The returned key and value are only valid for the life of the transaction.
60 61
 func (c *Cursor) Next() (key []byte, value []byte) {
61 62
 	_assert(c.bucket.tx.db != nil, "tx closed")
62 63
 	k, v, flags := c.next()
... ...
@@ -68,6 +74,7 @@ func (c *Cursor) Next() (key []byte, value []byte) {
68 68
 
69 69
 // Prev moves the cursor to the previous item in the bucket and returns its key and value.
70 70
 // If the cursor is at the beginning of the bucket then a nil key and value are returned.
71
+// The returned key and value are only valid for the life of the transaction.
71 72
 func (c *Cursor) Prev() (key []byte, value []byte) {
72 73
 	_assert(c.bucket.tx.db != nil, "tx closed")
73 74
 
... ...
@@ -99,6 +106,7 @@ func (c *Cursor) Prev() (key []byte, value []byte) {
99 99
 // Seek moves the cursor to a given key and returns it.
100 100
 // If the key does not exist then the next key is used. If no keys
101 101
 // follow, a nil key is returned.
102
+// The returned key and value are only valid for the life of the transaction.
102 103
 func (c *Cursor) Seek(seek []byte) (key []byte, value []byte) {
103 104
 	k, v, flags := c.seek(seek)
104 105
 
... ...
@@ -228,8 +236,8 @@ func (c *Cursor) next() (key []byte, value []byte, flags uint32) {
228 228
 // search recursively performs a binary search against a given page/node until it finds a given key.
229 229
 func (c *Cursor) search(key []byte, pgid pgid) {
230 230
 	p, n := c.bucket.pageNode(pgid)
231
-	if p != nil {
232
-		_assert((p.flags&(branchPageFlag|leafPageFlag)) != 0, "invalid page type: %d: %x", p.id, p.flags)
231
+	if p != nil && (p.flags&(branchPageFlag|leafPageFlag)) == 0 {
232
+		panic(fmt.Sprintf("invalid page type: %d: %x", p.id, p.flags))
233 233
 	}
234 234
 	e := elemRef{page: p, node: n}
235 235
 	c.stack = append(c.stack, e)
... ...
@@ -12,9 +12,6 @@ import (
12 12
 	"unsafe"
13 13
 )
14 14
 
15
-// The smallest size that the mmap can be.
16
-const minMmapSize = 1 << 22 // 4MB
17
-
18 15
 // The largest step that can be taken when remapping the mmap.
19 16
 const maxMmapStep = 1 << 30 // 1GB
20 17
 
... ...
@@ -30,6 +27,12 @@ const magic uint32 = 0xED0CDAED
30 30
 // must be synchronzied using the msync(2) syscall.
31 31
 const IgnoreNoSync = runtime.GOOS == "openbsd"
32 32
 
33
+// Default values if not set in a DB instance.
34
+const (
35
+	DefaultMaxBatchSize  int = 1000
36
+	DefaultMaxBatchDelay     = 10 * time.Millisecond
37
+)
38
+
33 39
 // DB represents a collection of buckets persisted to a file on disk.
34 40
 // All data access is performed through transactions which can be obtained through the DB.
35 41
 // All the functions on DB will return a ErrDatabaseNotOpen if accessed before Open() is called.
... ...
@@ -52,9 +55,33 @@ type DB struct {
52 52
 	// THIS IS UNSAFE. PLEASE USE WITH CAUTION.
53 53
 	NoSync bool
54 54
 
55
+	// When true, skips the truncate call when growing the database.
56
+	// Setting this to true is only safe on non-ext3/ext4 systems.
57
+	// Skipping truncation avoids preallocation of hard drive space and
58
+	// bypasses a truncate() and fsync() syscall on remapping.
59
+	//
60
+	// https://github.com/boltdb/bolt/issues/284
61
+	NoGrowSync bool
62
+
63
+	// MaxBatchSize is the maximum size of a batch. Default value is
64
+	// copied from DefaultMaxBatchSize in Open.
65
+	//
66
+	// If <=0, disables batching.
67
+	//
68
+	// Do not change concurrently with calls to Batch.
69
+	MaxBatchSize int
70
+
71
+	// MaxBatchDelay is the maximum delay before a batch starts.
72
+	// Default value is copied from DefaultMaxBatchDelay in Open.
73
+	//
74
+	// If <=0, effectively disables batching.
75
+	//
76
+	// Do not change concurrently with calls to Batch.
77
+	MaxBatchDelay time.Duration
78
+
55 79
 	path     string
56 80
 	file     *os.File
57
-	dataref  []byte
81
+	dataref  []byte // mmap'ed readonly, write throws SEGV
58 82
 	data     *[maxMapSize]byte
59 83
 	datasz   int
60 84
 	meta0    *meta
... ...
@@ -66,6 +93,9 @@ type DB struct {
66 66
 	freelist *freelist
67 67
 	stats    Stats
68 68
 
69
+	batchMu sync.Mutex
70
+	batch   *batch
71
+
69 72
 	rwlock   sync.Mutex   // Allows only one writer at a time.
70 73
 	metalock sync.Mutex   // Protects meta page access.
71 74
 	mmaplock sync.RWMutex // Protects mmap access during remapping.
... ...
@@ -74,6 +104,10 @@ type DB struct {
74 74
 	ops struct {
75 75
 		writeAt func(b []byte, off int64) (n int, err error)
76 76
 	}
77
+
78
+	// Read only mode.
79
+	// When true, Update() and Begin(true) return ErrDatabaseReadOnly immediately.
80
+	readOnly bool
77 81
 }
78 82
 
79 83
 // Path returns the path to currently open database file.
... ...
@@ -101,20 +135,34 @@ func Open(path string, mode os.FileMode, options *Options) (*DB, error) {
101 101
 	if options == nil {
102 102
 		options = DefaultOptions
103 103
 	}
104
+	db.NoGrowSync = options.NoGrowSync
105
+
106
+	// Set default values for later DB operations.
107
+	db.MaxBatchSize = DefaultMaxBatchSize
108
+	db.MaxBatchDelay = DefaultMaxBatchDelay
109
+
110
+	flag := os.O_RDWR
111
+	if options.ReadOnly {
112
+		flag = os.O_RDONLY
113
+		db.readOnly = true
114
+	}
104 115
 
105 116
 	// Open data file and separate sync handler for metadata writes.
106 117
 	db.path = path
107
-
108 118
 	var err error
109
-	if db.file, err = os.OpenFile(db.path, os.O_RDWR|os.O_CREATE, mode); err != nil {
119
+	if db.file, err = os.OpenFile(db.path, flag|os.O_CREATE, mode); err != nil {
110 120
 		_ = db.close()
111 121
 		return nil, err
112 122
 	}
113 123
 
114
-	// Lock file so that other processes using Bolt cannot use the database
115
-	// at the same time. This would cause corruption since the two processes
116
-	// would write meta pages and free pages separately.
117
-	if err := flock(db.file, options.Timeout); err != nil {
124
+	// Lock file so that other processes using Bolt in read-write mode cannot
125
+	// use the database  at the same time. This would cause corruption since
126
+	// the two processes would write meta pages and free pages separately.
127
+	// The database file is locked exclusively (only one process can grab the lock)
128
+	// if !options.ReadOnly.
129
+	// The database file is locked using the shared lock (more than one process may
130
+	// hold a lock at the same time) otherwise (options.ReadOnly is set).
131
+	if err := flock(db.file, !db.readOnly, options.Timeout); err != nil {
118 132
 		_ = db.close()
119 133
 		return nil, err
120 134
 	}
... ...
@@ -162,16 +210,6 @@ func (db *DB) mmap(minsz int) error {
162 162
 	db.mmaplock.Lock()
163 163
 	defer db.mmaplock.Unlock()
164 164
 
165
-	// Dereference all mmap references before unmapping.
166
-	if db.rwtx != nil {
167
-		db.rwtx.root.dereference()
168
-	}
169
-
170
-	// Unmap existing data before continuing.
171
-	if err := db.munmap(); err != nil {
172
-		return err
173
-	}
174
-
175 165
 	info, err := db.file.Stat()
176 166
 	if err != nil {
177 167
 		return fmt.Errorf("mmap stat error: %s", err)
... ...
@@ -184,7 +222,20 @@ func (db *DB) mmap(minsz int) error {
184 184
 	if size < minsz {
185 185
 		size = minsz
186 186
 	}
187
-	size = db.mmapSize(size)
187
+	size, err = db.mmapSize(size)
188
+	if err != nil {
189
+		return err
190
+	}
191
+
192
+	// Dereference all mmap references before unmapping.
193
+	if db.rwtx != nil {
194
+		db.rwtx.root.dereference()
195
+	}
196
+
197
+	// Unmap existing data before continuing.
198
+	if err := db.munmap(); err != nil {
199
+		return err
200
+	}
188 201
 
189 202
 	// Memory-map the data file as a byte slice.
190 203
 	if err := mmap(db, size); err != nil {
... ...
@@ -215,22 +266,40 @@ func (db *DB) munmap() error {
215 215
 }
216 216
 
217 217
 // mmapSize determines the appropriate size for the mmap given the current size
218
-// of the database. The minimum size is 4MB and doubles until it reaches 1GB.
219
-func (db *DB) mmapSize(size int) int {
220
-	if size <= minMmapSize {
221
-		return minMmapSize
222
-	} else if size < maxMmapStep {
223
-		size *= 2
224
-	} else {
225
-		size += maxMmapStep
218
+// of the database. The minimum size is 1MB and doubles until it reaches 1GB.
219
+// Returns an error if the new mmap size is greater than the max allowed.
220
+func (db *DB) mmapSize(size int) (int, error) {
221
+	// Double the size from 32KB until 1GB.
222
+	for i := uint(15); i <= 30; i++ {
223
+		if size <= 1<<i {
224
+			return 1 << i, nil
225
+		}
226
+	}
227
+
228
+	// Verify the requested size is not above the maximum allowed.
229
+	if size > maxMapSize {
230
+		return 0, fmt.Errorf("mmap too large")
231
+	}
232
+
233
+	// If larger than 1GB then grow by 1GB at a time.
234
+	sz := int64(size)
235
+	if remainder := sz % int64(maxMmapStep); remainder > 0 {
236
+		sz += int64(maxMmapStep) - remainder
226 237
 	}
227 238
 
228 239
 	// Ensure that the mmap size is a multiple of the page size.
229
-	if (size % db.pageSize) != 0 {
230
-		size = ((size / db.pageSize) + 1) * db.pageSize
240
+	// This should always be true since we're incrementing in MBs.
241
+	pageSize := int64(db.pageSize)
242
+	if (sz % pageSize) != 0 {
243
+		sz = ((sz / pageSize) + 1) * pageSize
244
+	}
245
+
246
+	// If we've exceeded the max size then only grow up to the max size.
247
+	if sz > maxMapSize {
248
+		sz = maxMapSize
231 249
 	}
232 250
 
233
-	return size
251
+	return int(sz), nil
234 252
 }
235 253
 
236 254
 // init creates a new database file and initializes its meta pages.
... ...
@@ -250,7 +319,6 @@ func (db *DB) init() error {
250 250
 		m.magic = magic
251 251
 		m.version = version
252 252
 		m.pageSize = uint32(db.pageSize)
253
-		m.version = version
254 253
 		m.freelist = 2
255 254
 		m.root = bucket{root: 3}
256 255
 		m.pgid = 4
... ...
@@ -283,8 +351,15 @@ func (db *DB) init() error {
283 283
 // Close releases all database resources.
284 284
 // All transactions must be closed before closing the database.
285 285
 func (db *DB) Close() error {
286
+	db.rwlock.Lock()
287
+	defer db.rwlock.Unlock()
288
+
286 289
 	db.metalock.Lock()
287 290
 	defer db.metalock.Unlock()
291
+
292
+	db.mmaplock.RLock()
293
+	defer db.mmaplock.RUnlock()
294
+
288 295
 	return db.close()
289 296
 }
290 297
 
... ...
@@ -304,8 +379,11 @@ func (db *DB) close() error {
304 304
 
305 305
 	// Close file handles.
306 306
 	if db.file != nil {
307
-		// Unlock the file.
308
-		_ = funlock(db.file)
307
+		// No need to unlock read-only file.
308
+		if !db.readOnly {
309
+			// Unlock the file.
310
+			_ = funlock(db.file)
311
+		}
309 312
 
310 313
 		// Close the file descriptor.
311 314
 		if err := db.file.Close(); err != nil {
... ...
@@ -323,6 +401,11 @@ func (db *DB) close() error {
323 323
 // will cause the calls to block and be serialized until the current write
324 324
 // transaction finishes.
325 325
 //
326
+// Transactions should not be depedent on one another. Opening a read
327
+// transaction and a write transaction in the same goroutine can cause the
328
+// writer to deadlock because the database periodically needs to re-mmap itself
329
+// as it grows and it cannot do that while a read transaction is open.
330
+//
326 331
 // IMPORTANT: You must close read-only transactions after you are finished or
327 332
 // else the database will not reclaim old pages.
328 333
 func (db *DB) Begin(writable bool) (*Tx, error) {
... ...
@@ -371,6 +454,11 @@ func (db *DB) beginTx() (*Tx, error) {
371 371
 }
372 372
 
373 373
 func (db *DB) beginRWTx() (*Tx, error) {
374
+	// If the database was opened with Options.ReadOnly, return an error.
375
+	if db.readOnly {
376
+		return nil, ErrDatabaseReadOnly
377
+	}
378
+
374 379
 	// Obtain writer lock. This is released by the transaction when it closes.
375 380
 	// This enforces only one writer transaction at a time.
376 381
 	db.rwlock.Lock()
... ...
@@ -501,6 +589,12 @@ func (db *DB) View(fn func(*Tx) error) error {
501 501
 	return nil
502 502
 }
503 503
 
504
+// Sync executes fdatasync() against the database file handle.
505
+//
506
+// This is not necessary under normal operation, however, if you use NoSync
507
+// then it allows you to force the database file to sync against the disk.
508
+func (db *DB) Sync() error { return fdatasync(db) }
509
+
504 510
 // Stats retrieves ongoing performance stats for the database.
505 511
 // This is only updated when a transaction closes.
506 512
 func (db *DB) Stats() Stats {
... ...
@@ -561,18 +655,30 @@ func (db *DB) allocate(count int) (*page, error) {
561 561
 	return p, nil
562 562
 }
563 563
 
564
+func (db *DB) IsReadOnly() bool {
565
+	return db.readOnly
566
+}
567
+
564 568
 // Options represents the options that can be set when opening a database.
565 569
 type Options struct {
566 570
 	// Timeout is the amount of time to wait to obtain a file lock.
567 571
 	// When set to zero it will wait indefinitely. This option is only
568 572
 	// available on Darwin and Linux.
569 573
 	Timeout time.Duration
574
+
575
+	// Sets the DB.NoGrowSync flag before memory mapping the file.
576
+	NoGrowSync bool
577
+
578
+	// Open database in read-only mode. Uses flock(..., LOCK_SH |LOCK_NB) to
579
+	// grab a shared lock (UNIX).
580
+	ReadOnly bool
570 581
 }
571 582
 
572 583
 // DefaultOptions represent the options used if nil options are passed into Open().
573 584
 // No timeout is used which will cause Bolt to wait indefinitely for a lock.
574 585
 var DefaultOptions = &Options{
575
-	Timeout: 0,
586
+	Timeout:    0,
587
+	NoGrowSync: false,
576 588
 }
577 589
 
578 590
 // Stats represents statistics about the database.
... ...
@@ -647,9 +753,11 @@ func (m *meta) copy(dest *meta) {
647 647
 
648 648
 // write writes the meta onto a page.
649 649
 func (m *meta) write(p *page) {
650
-
651
-	_assert(m.root.root < m.pgid, "root bucket pgid (%d) above high water mark (%d)", m.root.root, m.pgid)
652
-	_assert(m.freelist < m.pgid, "freelist pgid (%d) above high water mark (%d)", m.freelist, m.pgid)
650
+	if m.root.root >= m.pgid {
651
+		panic(fmt.Sprintf("root bucket pgid (%d) above high water mark (%d)", m.root.root, m.pgid))
652
+	} else if m.freelist >= m.pgid {
653
+		panic(fmt.Sprintf("freelist pgid (%d) above high water mark (%d)", m.freelist, m.pgid))
654
+	}
653 655
 
654 656
 	// Page id is either going to be 0 or 1 which we can determine by the transaction ID.
655 657
 	p.id = pgid(m.txid % 2)
... ...
@@ -675,13 +783,8 @@ func _assert(condition bool, msg string, v ...interface{}) {
675 675
 	}
676 676
 }
677 677
 
678
-func warn(v ...interface{}) {
679
-	fmt.Fprintln(os.Stderr, v...)
680
-}
681
-
682
-func warnf(msg string, v ...interface{}) {
683
-	fmt.Fprintf(os.Stderr, msg+"\n", v...)
684
-}
678
+func warn(v ...interface{})              { fmt.Fprintln(os.Stderr, v...) }
679
+func warnf(msg string, v ...interface{}) { fmt.Fprintf(os.Stderr, msg+"\n", v...) }
685 680
 
686 681
 func printstack() {
687 682
 	stack := strings.Join(strings.Split(string(debug.Stack()), "\n")[2:], "\n")
... ...
@@ -36,6 +36,10 @@ var (
36 36
 	// ErrTxClosed is returned when committing or rolling back a transaction
37 37
 	// that has already been committed or rolled back.
38 38
 	ErrTxClosed = errors.New("tx closed")
39
+
40
+	// ErrDatabaseReadOnly is returned when a mutating transaction is started on a
41
+	// read-only database.
42
+	ErrDatabaseReadOnly = errors.New("database is in read-only mode")
39 43
 )
40 44
 
41 45
 // These errors can occur when putting or deleting a value or a bucket.
... ...
@@ -1,6 +1,7 @@
1 1
 package bolt
2 2
 
3 3
 import (
4
+	"fmt"
4 5
 	"sort"
5 6
 	"unsafe"
6 7
 )
... ...
@@ -47,15 +48,14 @@ func (f *freelist) pending_count() int {
47 47
 
48 48
 // all returns a list of all free ids and all pending ids in one sorted list.
49 49
 func (f *freelist) all() []pgid {
50
-	ids := make([]pgid, len(f.ids))
51
-	copy(ids, f.ids)
50
+	m := make(pgids, 0)
52 51
 
53 52
 	for _, list := range f.pending {
54
-		ids = append(ids, list...)
53
+		m = append(m, list...)
55 54
 	}
56 55
 
57
-	sort.Sort(pgids(ids))
58
-	return ids
56
+	sort.Sort(m)
57
+	return pgids(f.ids).merge(m)
59 58
 }
60 59
 
61 60
 // allocate returns the starting page id of a contiguous list of pages of a given size.
... ...
@@ -67,7 +67,9 @@ func (f *freelist) allocate(n int) pgid {
67 67
 
68 68
 	var initial, previd pgid
69 69
 	for i, id := range f.ids {
70
-		_assert(id > 1, "invalid page allocation: %d", id)
70
+		if id <= 1 {
71
+			panic(fmt.Sprintf("invalid page allocation: %d", id))
72
+		}
71 73
 
72 74
 		// Reset initial page if this is not contiguous.
73 75
 		if previd == 0 || id-previd != 1 {
... ...
@@ -103,13 +105,17 @@ func (f *freelist) allocate(n int) pgid {
103 103
 // free releases a page and its overflow for a given transaction id.
104 104
 // If the page is already free then a panic will occur.
105 105
 func (f *freelist) free(txid txid, p *page) {
106
-	_assert(p.id > 1, "cannot free page 0 or 1: %d", p.id)
106
+	if p.id <= 1 {
107
+		panic(fmt.Sprintf("cannot free page 0 or 1: %d", p.id))
108
+	}
107 109
 
108 110
 	// Free page and all its overflow pages.
109 111
 	var ids = f.pending[txid]
110 112
 	for id := p.id; id <= p.id+pgid(p.overflow); id++ {
111 113
 		// Verify that page is not already free.
112
-		_assert(!f.cache[id], "page %d already freed", id)
114
+		if f.cache[id] {
115
+			panic(fmt.Sprintf("page %d already freed", id))
116
+		}
113 117
 
114 118
 		// Add to the freelist and cache.
115 119
 		ids = append(ids, id)
... ...
@@ -120,15 +126,17 @@ func (f *freelist) free(txid txid, p *page) {
120 120
 
121 121
 // release moves all page ids for a transaction id (or older) to the freelist.
122 122
 func (f *freelist) release(txid txid) {
123
+	m := make(pgids, 0)
123 124
 	for tid, ids := range f.pending {
124 125
 		if tid <= txid {
125 126
 			// Move transaction's pending pages to the available freelist.
126 127
 			// Don't remove from the cache since the page is still free.
127
-			f.ids = append(f.ids, ids...)
128
+			m = append(m, ids...)
128 129
 			delete(f.pending, tid)
129 130
 		}
130 131
 	}
131
-	sort.Sort(pgids(f.ids))
132
+	sort.Sort(m)
133
+	f.ids = pgids(f.ids).merge(m)
132 134
 }
133 135
 
134 136
 // rollback removes the pages from a given pending tx.
... ...
@@ -2,6 +2,7 @@ package bolt
2 2
 
3 3
 import (
4 4
 	"bytes"
5
+	"fmt"
5 6
 	"sort"
6 7
 	"unsafe"
7 8
 )
... ...
@@ -70,7 +71,9 @@ func (n *node) pageElementSize() int {
70 70
 
71 71
 // childAt returns the child node at a given index.
72 72
 func (n *node) childAt(index int) *node {
73
-	_assert(!n.isLeaf, "invalid childAt(%d) on a leaf node", index)
73
+	if n.isLeaf {
74
+		panic(fmt.Sprintf("invalid childAt(%d) on a leaf node", index))
75
+	}
74 76
 	return n.bucket.node(n.inodes[index].pgid, n)
75 77
 }
76 78
 
... ...
@@ -111,9 +114,13 @@ func (n *node) prevSibling() *node {
111 111
 
112 112
 // put inserts a key/value.
113 113
 func (n *node) put(oldKey, newKey, value []byte, pgid pgid, flags uint32) {
114
-	_assert(pgid < n.bucket.tx.meta.pgid, "pgid (%d) above high water mark (%d)", pgid, n.bucket.tx.meta.pgid)
115
-	_assert(len(oldKey) > 0, "put: zero-length old key")
116
-	_assert(len(newKey) > 0, "put: zero-length new key")
114
+	if pgid >= n.bucket.tx.meta.pgid {
115
+		panic(fmt.Sprintf("pgid (%d) above high water mark (%d)", pgid, n.bucket.tx.meta.pgid))
116
+	} else if len(oldKey) <= 0 {
117
+		panic("put: zero-length old key")
118
+	} else if len(newKey) <= 0 {
119
+		panic("put: zero-length new key")
120
+	}
117 121
 
118 122
 	// Find insertion index.
119 123
 	index := sort.Search(len(n.inodes), func(i int) bool { return bytes.Compare(n.inodes[i].key, oldKey) != -1 })
... ...
@@ -189,7 +196,9 @@ func (n *node) write(p *page) {
189 189
 		p.flags |= branchPageFlag
190 190
 	}
191 191
 
192
-	_assert(len(n.inodes) < 0xFFFF, "inode overflow: %d (pgid=%d)", len(n.inodes), p.id)
192
+	if len(n.inodes) >= 0xFFFF {
193
+		panic(fmt.Sprintf("inode overflow: %d (pgid=%d)", len(n.inodes), p.id))
194
+	}
193 195
 	p.count = uint16(len(n.inodes))
194 196
 
195 197
 	// Loop over each item and write it to the page.
... ...
@@ -212,11 +221,20 @@ func (n *node) write(p *page) {
212 212
 			_assert(elem.pgid != p.id, "write: circular dependency occurred")
213 213
 		}
214 214
 
215
+		// If the length of key+value is larger than the max allocation size
216
+		// then we need to reallocate the byte array pointer.
217
+		//
218
+		// See: https://github.com/boltdb/bolt/pull/335
219
+		klen, vlen := len(item.key), len(item.value)
220
+		if len(b) < klen+vlen {
221
+			b = (*[maxAllocSize]byte)(unsafe.Pointer(&b[0]))[:]
222
+		}
223
+
215 224
 		// Write data for the element to the end of the page.
216 225
 		copy(b[0:], item.key)
217
-		b = b[len(item.key):]
226
+		b = b[klen:]
218 227
 		copy(b[0:], item.value)
219
-		b = b[len(item.value):]
228
+		b = b[vlen:]
220 229
 	}
221 230
 
222 231
 	// DEBUG ONLY: n.dump()
... ...
@@ -348,7 +366,9 @@ func (n *node) spill() error {
348 348
 		}
349 349
 
350 350
 		// Write the node.
351
-		_assert(p.id < tx.meta.pgid, "pgid (%d) above high water mark (%d)", p.id, tx.meta.pgid)
351
+		if p.id >= tx.meta.pgid {
352
+			panic(fmt.Sprintf("pgid (%d) above high water mark (%d)", p.id, tx.meta.pgid))
353
+		}
352 354
 		node.pgid = p.id
353 355
 		node.write(p)
354 356
 		node.spilled = true
... ...
@@ -3,12 +3,12 @@ package bolt
3 3
 import (
4 4
 	"fmt"
5 5
 	"os"
6
+	"sort"
6 7
 	"unsafe"
7 8
 )
8 9
 
9 10
 const pageHeaderSize = int(unsafe.Offsetof(((*page)(nil)).ptr))
10 11
 
11
-const maxAllocSize = 0xFFFFFFF
12 12
 const minKeysPerPage = 2
13 13
 
14 14
 const branchPageElementSize = int(unsafe.Sizeof(branchPageElement{}))
... ...
@@ -97,7 +97,7 @@ type branchPageElement struct {
97 97
 // key returns a byte slice of the node key.
98 98
 func (n *branchPageElement) key() []byte {
99 99
 	buf := (*[maxAllocSize]byte)(unsafe.Pointer(n))
100
-	return buf[n.pos : n.pos+n.ksize]
100
+	return (*[maxAllocSize]byte)(unsafe.Pointer(&buf[n.pos]))[:n.ksize]
101 101
 }
102 102
 
103 103
 // leafPageElement represents a node on a leaf page.
... ...
@@ -111,13 +111,13 @@ type leafPageElement struct {
111 111
 // key returns a byte slice of the node key.
112 112
 func (n *leafPageElement) key() []byte {
113 113
 	buf := (*[maxAllocSize]byte)(unsafe.Pointer(n))
114
-	return buf[n.pos : n.pos+n.ksize]
114
+	return (*[maxAllocSize]byte)(unsafe.Pointer(&buf[n.pos]))[:n.ksize]
115 115
 }
116 116
 
117 117
 // value returns a byte slice of the node value.
118 118
 func (n *leafPageElement) value() []byte {
119 119
 	buf := (*[maxAllocSize]byte)(unsafe.Pointer(n))
120
-	return buf[n.pos+n.ksize : n.pos+n.ksize+n.vsize]
120
+	return (*[maxAllocSize]byte)(unsafe.Pointer(&buf[n.pos+n.ksize]))[:n.vsize]
121 121
 }
122 122
 
123 123
 // PageInfo represents human readable information about a page.
... ...
@@ -133,3 +133,40 @@ type pgids []pgid
133 133
 func (s pgids) Len() int           { return len(s) }
134 134
 func (s pgids) Swap(i, j int)      { s[i], s[j] = s[j], s[i] }
135 135
 func (s pgids) Less(i, j int) bool { return s[i] < s[j] }
136
+
137
+// merge returns the sorted union of a and b.
138
+func (a pgids) merge(b pgids) pgids {
139
+	// Return the opposite slice if one is nil.
140
+	if len(a) == 0 {
141
+		return b
142
+	} else if len(b) == 0 {
143
+		return a
144
+	}
145
+
146
+	// Create a list to hold all elements from both lists.
147
+	merged := make(pgids, 0, len(a)+len(b))
148
+
149
+	// Assign lead to the slice with a lower starting value, follow to the higher value.
150
+	lead, follow := a, b
151
+	if b[0] < a[0] {
152
+		lead, follow = b, a
153
+	}
154
+
155
+	// Continue while there are elements in the lead.
156
+	for len(lead) > 0 {
157
+		// Merge largest prefix of lead that is ahead of follow[0].
158
+		n := sort.Search(len(lead), func(i int) bool { return lead[i] > follow[0] })
159
+		merged = append(merged, lead[:n]...)
160
+		if n >= len(lead) {
161
+			break
162
+		}
163
+
164
+		// Swap lead and follow.
165
+		lead, follow = follow, lead[n:]
166
+	}
167
+
168
+	// Append what's left in follow.
169
+	merged = append(merged, follow...)
170
+
171
+	return merged
172
+}
... ...
@@ -87,18 +87,21 @@ func (tx *Tx) Stats() TxStats {
87 87
 
88 88
 // Bucket retrieves a bucket by name.
89 89
 // Returns nil if the bucket does not exist.
90
+// The bucket instance is only valid for the lifetime of the transaction.
90 91
 func (tx *Tx) Bucket(name []byte) *Bucket {
91 92
 	return tx.root.Bucket(name)
92 93
 }
93 94
 
94 95
 // CreateBucket creates a new bucket.
95 96
 // Returns an error if the bucket already exists, if the bucket name is blank, or if the bucket name is too long.
97
+// The bucket instance is only valid for the lifetime of the transaction.
96 98
 func (tx *Tx) CreateBucket(name []byte) (*Bucket, error) {
97 99
 	return tx.root.CreateBucket(name)
98 100
 }
99 101
 
100 102
 // CreateBucketIfNotExists creates a new bucket if it doesn't already exist.
101 103
 // Returns an error if the bucket name is blank, or if the bucket name is too long.
104
+// The bucket instance is only valid for the lifetime of the transaction.
102 105
 func (tx *Tx) CreateBucketIfNotExists(name []byte) (*Bucket, error) {
103 106
 	return tx.root.CreateBucketIfNotExists(name)
104 107
 }
... ...
@@ -127,7 +130,8 @@ func (tx *Tx) OnCommit(fn func()) {
127 127
 }
128 128
 
129 129
 // Commit writes all changes to disk and updates the meta page.
130
-// Returns an error if a disk write error occurs.
130
+// Returns an error if a disk write error occurs, or if Commit is
131
+// called on a read-only transaction.
131 132
 func (tx *Tx) Commit() error {
132 133
 	_assert(!tx.managed, "managed tx commit not allowed")
133 134
 	if tx.db == nil {
... ...
@@ -203,7 +207,8 @@ func (tx *Tx) Commit() error {
203 203
 	return nil
204 204
 }
205 205
 
206
-// Rollback closes the transaction and ignores all previous updates.
206
+// Rollback closes the transaction and ignores all previous updates. Read-only
207
+// transactions must be rolled back and not committed.
207 208
 func (tx *Tx) Rollback() error {
208 209
 	_assert(!tx.managed, "managed tx rollback not allowed")
209 210
 	if tx.db == nil {
... ...
@@ -234,7 +239,8 @@ func (tx *Tx) close() {
234 234
 		var freelistPendingN = tx.db.freelist.pending_count()
235 235
 		var freelistAlloc = tx.db.freelist.size()
236 236
 
237
-		// Remove writer lock.
237
+		// Remove transaction ref & writer lock.
238
+		tx.db.rwtx = nil
238 239
 		tx.db.rwlock.Unlock()
239 240
 
240 241
 		// Merge statistics.
... ...
@@ -248,41 +254,51 @@ func (tx *Tx) close() {
248 248
 	} else {
249 249
 		tx.db.removeTx(tx)
250 250
 	}
251
+
252
+	// Clear all references.
251 253
 	tx.db = nil
254
+	tx.meta = nil
255
+	tx.root = Bucket{tx: tx}
256
+	tx.pages = nil
252 257
 }
253 258
 
254 259
 // Copy writes the entire database to a writer.
255
-// A reader transaction is maintained during the copy so it is safe to continue
256
-// using the database while a copy is in progress.
257
-// Copy will write exactly tx.Size() bytes into the writer.
260
+// This function exists for backwards compatibility. Use WriteTo() in
258 261
 func (tx *Tx) Copy(w io.Writer) error {
259
-	var f *os.File
260
-	var err error
262
+	_, err := tx.WriteTo(w)
263
+	return err
264
+}
261 265
 
266
+// WriteTo writes the entire database to a writer.
267
+// If err == nil then exactly tx.Size() bytes will be written into the writer.
268
+func (tx *Tx) WriteTo(w io.Writer) (n int64, err error) {
262 269
 	// Attempt to open reader directly.
270
+	var f *os.File
263 271
 	if f, err = os.OpenFile(tx.db.path, os.O_RDONLY|odirect, 0); err != nil {
264 272
 		// Fallback to a regular open if that doesn't work.
265 273
 		if f, err = os.OpenFile(tx.db.path, os.O_RDONLY, 0); err != nil {
266
-			return err
274
+			return 0, err
267 275
 		}
268 276
 	}
269 277
 
270 278
 	// Copy the meta pages.
271 279
 	tx.db.metalock.Lock()
272
-	_, err = io.CopyN(w, f, int64(tx.db.pageSize*2))
280
+	n, err = io.CopyN(w, f, int64(tx.db.pageSize*2))
273 281
 	tx.db.metalock.Unlock()
274 282
 	if err != nil {
275 283
 		_ = f.Close()
276
-		return fmt.Errorf("meta copy: %s", err)
284
+		return n, fmt.Errorf("meta copy: %s", err)
277 285
 	}
278 286
 
279 287
 	// Copy data pages.
280
-	if _, err := io.CopyN(w, f, tx.Size()-int64(tx.db.pageSize*2)); err != nil {
288
+	wn, err := io.CopyN(w, f, tx.Size()-int64(tx.db.pageSize*2))
289
+	n += wn
290
+	if err != nil {
281 291
 		_ = f.Close()
282
-		return err
292
+		return n, err
283 293
 	}
284 294
 
285
-	return f.Close()
295
+	return n, f.Close()
286 296
 }
287 297
 
288 298
 // CopyFile copies the entire database to file at the given path.
... ...
@@ -416,15 +432,39 @@ func (tx *Tx) write() error {
416 416
 	// Write pages to disk in order.
417 417
 	for _, p := range pages {
418 418
 		size := (int(p.overflow) + 1) * tx.db.pageSize
419
-		buf := (*[maxAllocSize]byte)(unsafe.Pointer(p))[:size]
420 419
 		offset := int64(p.id) * int64(tx.db.pageSize)
421
-		if _, err := tx.db.ops.writeAt(buf, offset); err != nil {
422
-			return err
423
-		}
424 420
 
425
-		// Update statistics.
426
-		tx.stats.Write++
421
+		// Write out page in "max allocation" sized chunks.
422
+		ptr := (*[maxAllocSize]byte)(unsafe.Pointer(p))
423
+		for {
424
+			// Limit our write to our max allocation size.
425
+			sz := size
426
+			if sz > maxAllocSize-1 {
427
+				sz = maxAllocSize - 1
428
+			}
429
+
430
+			// Write chunk to disk.
431
+			buf := ptr[:sz]
432
+			if _, err := tx.db.ops.writeAt(buf, offset); err != nil {
433
+				return err
434
+			}
435
+
436
+			// Update statistics.
437
+			tx.stats.Write++
438
+
439
+			// Exit inner for loop if we've written all the chunks.
440
+			size -= sz
441
+			if size == 0 {
442
+				break
443
+			}
444
+
445
+			// Otherwise move offset forward and move pointer to next chunk.
446
+			offset += int64(sz)
447
+			ptr = (*[maxAllocSize]byte)(unsafe.Pointer(&ptr[sz]))
448
+		}
427 449
 	}
450
+
451
+	// Ignore file sync if flag is set on DB.
428 452
 	if !tx.db.NoSync || IgnoreNoSync {
429 453
 		if err := fdatasync(tx.db); err != nil {
430 454
 			return err