Go基本库阅读:bufio库
目录
bufio
这个库是io库的实现,如果需要自行实现io库可以看一下这个库中Read和Write相关函数的实现。
本质上相当于io加上缓冲区。
参考:
- 1.4 bufio库
- 常见的表述,包括函数功能解释参考此文,本文不再赘述
- 源代码
- 代码
Reader
bufio的Reader本质上是io.Reader的wrap,只是多了一个缓冲区并进行了一些对应的实现
type Reader struct {
buf []byte
rd io.Reader // reader provided by the client
r, w int // buf read and write positions 其中r代表着缓冲区的开头,w代表着缓冲区的末尾
err error
lastByte int // last byte read for UnreadByte; -1 means invalid
lastRuneSize int // size of last rune read for UnreadRune; -1 means invalid
}
初始化函数,下列函数可以初始化并得到一个新的Reader
- NewReaderSize:根据给定的ReaderSize申请缓冲区
- NewReader:根据默认的ReaderSize申请缓冲区
- Reset:重置各项参数,如果本身是nil则用默认值申请缓冲区
常用到的工具函数:
- fill,从rd中读取chunk,填充缓冲区。总是会尝试写满缓冲区,并且每次调用前会先整理缓冲区。
读取系列
- Read,如果缓冲区无数据则直接调用io.Reader.Read;如果缓冲区有数据,则将缓冲区数据写入到数组中。
- Peek,读取n个字节而不让缓冲区前进(表现在代码中则是b.r不变)。会反复调用fill直到缓冲区满或者缓冲区中的有效元素达到指定的大小,很有bufio特色的函数,只有在有缓冲的情况下才能实现的功能。
- ReadSlice,读取到delimiter为止
- ReadByte,读取下一个字节的数据
- ReadLine,本质上是ReadSlice(’\n’)
ReadBytes和ReadSlice的区别:
- ReadSlice返回的对象是Reader中的buffer,因此可能会在下一次读的时候失效;在代码中可以比较清楚的看到返回值line直接就是b.buf。
- ReadBytes则进行了复制,对返回得到的full和frag进行了对应的复制,拷贝到新的数组上
- 但是ReadBytes本质上是调用了ReadSlice进行的,核心代码如下:
// collectFragments reads until the first occurrence of delim in the input. It
// returns (slice of full buffers, remaining bytes before delim, total number
// of bytes in the combined first two elements, error).
// The complete result is equal to
// `bytes.Join(append(fullBuffers, finalFragment), nil)`, which has a
// length of `totalLen`. The result is structured in this way to allow callers
// to minimize allocations and copies.
func (b *Reader) collectFragments(delim byte) (fullBuffers [][]byte, finalFragment []byte, totalLen int, err error) {
var frag []byte
// Use ReadSlice to look for delim, accumulating full buffers.
for {
var e error
frag, e = b.ReadSlice(delim)
if e == nil { // got final fragment
break
}
if e != ErrBufferFull { // unexpected error
err = e
break
}
// Make a copy of the buffer.
buf := make([]byte, len(frag))
copy(buf, frag)
fullBuffers = append(fullBuffers, buf)
totalLen += len(buf)
}
totalLen += len(frag)
return fullBuffers, frag, totalLen, err
}
Writer
类似,费事写了
Scanner
Scanner专门使用了一个文件src/bufio/scan.go来实现,结构体定义如下。简单来说,Scan是为了分词功能而设计的,可以将流式的输入切分成多个token。对于非流式的输入,使用strings或bytes中的Split会更简单一些。
// Scanner provides a convenient interface for reading data such as
// a file of newline-delimited lines of text. Successive calls to
// the Scan method will step through the 'tokens' of a file, skipping
// the bytes between the tokens. The specification of a token is
// defined by a split function of type SplitFunc; the default split
// function breaks the input into lines with line termination stripped. Split
// functions are defined in this package for scanning a file into
// lines, bytes, UTF-8-encoded runes, and space-delimited words. The
// client may instead provide a custom split function.
//
// Scanning stops unrecoverably at EOF, the first I/O error, or a token too
// large to fit in the buffer. When a scan stops, the reader may have
// advanced arbitrarily far past the last token. Programs that need more
// control over error handling or large tokens, or must run sequential scans
// on a reader, should use bufio.Reader instead.
//
type Scanner struct {
r io.Reader // The reader provided by the client.
split SplitFunc // The function to split the tokens.
maxTokenSize int // Maximum size of a token; modified by tests.
token []byte // Last token returned by split. 得到的token,Text或Bytes方法实际上返回的就是这个值
buf []byte // Buffer used as argument to split.
start int // First non-processed byte in buf.
end int // End of data in buf.
err error // Sticky error.
empties int // Count of successive empty tokens.
scanCalled bool // Scan has been called; buffer is in use.
done bool // Scan has finished.
}
其中比较关键的几个成员是split、maxTokenSize、token、buf等
// SplitFunc is the signature of the split function used to tokenize the
// input. The arguments are an initial substring of the remaining unprocessed
// data and a flag, atEOF, that reports whether the Reader has no more data
// to give. The return values are the number of bytes to advance the input
// and the next token to return to the user, if any, plus an error, if any.
//
// Scanning stops if the function returns an error, in which case some of
// the input may be discarded. If that error is ErrFinalToken, scanning
// stops with no error.
//
// Otherwise, the Scanner advances the input. If the token is not nil,
// the Scanner returns it to the user. If the token is nil, the
// Scanner reads more data and continues scanning; if there is no more
// data--if atEOF was true--the Scanner returns. If the data does not
// yet hold a complete token, for instance if it has no newline while
// scanning lines, a SplitFunc can return (0, nil, nil) to signal the
// Scanner to read more data into the slice and try again with a
// longer slice starting at the same point in the input.
//
// The function is never called with an empty data slice unless atEOF
// is true. If atEOF is true, however, data may be non-empty and,
// as always, holds unprocessed text.
type SplitFunc func(data []byte, atEOF bool) (advance int, token []byte, err error)
// 翻译一下,data是需要分词的数据源,atEOF表明后面是否还有数据(当前是否为EOF)
// 输出的advance为读取的数据总量,token为得到的分词,err为错误系信息
// 默认使用的SplitFunc是SplitLines
// ScanLines is a split function for a Scanner that returns each line of
// text, stripped of any trailing end-of-line marker. The returned line may
// be empty. The end-of-line marker is one optional carriage return followed
// by one mandatory newline. In regular expression notation, it is `\r?\n`.
// The last non-empty line of input will be returned even if it has no
// newline.
func ScanLines(data []byte, atEOF bool) (advance int, token []byte, err error) {
// 如果达到了EOF而且没有需要读写的数据,则直接返回0
if atEOF && len(data) == 0 {
return 0, nil, nil
}
if i := bytes.IndexByte(data, '\n'); i >= 0 {
// We have a full newline-terminated line.
return i + 1, dropCR(data[0:i]), nil
}
// If we're at EOF, we have a final, non-terminated line. Return it.
if atEOF {
return len(data), dropCR(data), nil
}
// Request more data.
return 0, nil, nil
}
// dropCR drops a terminal \r from the data.
func dropCR(data []byte) []byte {
if len(data) > 0 && data[len(data)-1] == '\r' {
return data[0 : len(data)-1]
}
return data
}
使用
简单的例子:
package main
import (
"bufio"
"fmt"
"os"
"strings"
)
func main() {
const input = "This is The Golang Standard Library.\nWelcome you!"
scanner := bufio.NewScanner(strings.NewReader(input))
scanner.Split(bufio.ScanWords)
count := 0
for scanner.Scan() {
fmt.Println(scanner.Text())
count++
}
if err := scanner.Err(); err != nil {
fmt.Fprintln(os.Stderr, "reading input:", err)
}
fmt.Println(count)
}
- 使用NewScanner等方法可以初始化Scanner
- 使用Split等方法可以改变其属性
- scanner.Scan如同Next一样,如果有数据(token或残留数据)则返回true,并且结果可以通过scanner.Text()或者scanner.Bytes()等拿到
实现细节
Scan函数是核心
// src/bufio/scan.go:136
// 返回是否完成,false为已经结束,true为未结束
// Scan advances the Scanner to the next token, which will then be
// available through the Bytes or Text method. It returns false when the
// scan stops, either by reaching the end of the input or an error.
// After Scan returns false, the Err method will return any error that
// occurred during scanning, except that if it was io.EOF, Err
// will return nil.
// Scan panics if the split function returns too many empty
// tokens without advancing the input. This is a common error mode for
// scanners.
func (s *Scanner) Scan() bool {
if s.done { // 如果已经完成,则返回false
return false
}
s.scanCalled = true
// Loop until we have a token.
for {
// See if we can get a token with what we already have.
// If we've run out of data but have an error, give the split function
// a chance to recover any remaining, possibly empty token.
if s.end > s.start || s.err != nil {
advance, token, err := s.split(s.buf[s.start:s.end], s.err != nil)//如果s.err!=nil时为EOF
if err != nil {
if err == ErrFinalToken {
s.token = token
s.done = true
return true
}
s.setErr(err)
return false
}
if !s.advance(advance) {
return false
}
s.token = token // 注册得到的token
if token != nil {
if s.err == nil || advance > 0 {
s.empties = 0
} else { // token !=nil, s.err!=nil && advance<=0 这个条件说明splitFunc实现的有问题
// Returning tokens not advancing input at EOF.
s.empties++
if s.empties > maxConsecutiveEmptyReads {
panic("bufio.Scan: too many empty tokens without progressing")
}
}
return true
}
}
// 后面的全部为s.end == s.start && s.err == nil,或者s.token==nil
// We cannot generate a token with what we are holding.
// If we've already hit EOF or an I/O error, we are done.
if s.err != nil {
// Shut it down.
s.start = 0
s.end = 0
return false
}
// Must read more data.
// 尝试读取更多的数据来继续Scan
// First, shift data to beginning of buffer if there's lots of empty space
// or space is needed.
if s.start > 0 && (s.end == len(s.buf) || s.start > len(s.buf)/2) {
copy(s.buf, s.buf[s.start:s.end])
s.end -= s.start
s.start = 0
}
// Is the buffer full? If so, resize.
if s.end == len(s.buf) {
// Guarantee no overflow in the multiplication below.
const maxInt = int(^uint(0) >> 1)
if len(s.buf) >= s.maxTokenSize || len(s.buf) > maxInt/2 {
s.setErr(ErrTooLong)
return false
}
newSize := len(s.buf) * 2
if newSize == 0 {
newSize = startBufSize
}
if newSize > s.maxTokenSize {
newSize = s.maxTokenSize
}
newBuf := make([]byte, newSize)
copy(newBuf, s.buf[s.start:s.end])
s.buf = newBuf
s.end -= s.start
s.start = 0
}
// Finally we can read some input. Make sure we don't get stuck with
// a misbehaving Reader. Officially we don't need to do this, but let's
// be extra careful: Scanner is for safe, simple jobs.
for loop := 0; ; {
n, err := s.r.Read(s.buf[s.end:len(s.buf)])
if n < 0 || len(s.buf)-s.end < n {
s.setErr(ErrBadReadCount)
break
}
s.end += n
if err != nil {
s.setErr(err)
break
}
if n > 0 {
s.empties = 0
break
}
loop++
if loop > maxConsecutiveEmptyReads {
s.setErr(io.ErrNoProgress)
break
}
}
}
}
// advance consumes n bytes of the buffer. It reports whether the advance was legal.
// 检查的同时会修改s.start
func (s *Scanner) advance(n int) bool {
if n < 0 {
s.setErr(ErrNegativeAdvance)
return false
}
if n > s.end-s.start {
s.setErr(ErrAdvanceTooFar)
return false
}
s.start += n
return true
}