Which of the following could be used as a delimiter in a CSV Comma Separated Values file?

CSV files [Comma-separated values] are used to store tabular data [numbers and text] in plain text. "Plain text" means that the file is a pure string of characters without any hidden information that the computer has to process.

A CSV file stores data without a "record" number, separated by line breaks [each line of the file is a data "record"]. Each record has one or more "fields" separated by a delimiter, most commonly a comma [","], semicolon [";"] or the "invisible" character that appears when you press the "tab" key. Files separated by commas and semi-colons usually receive the "CSV" extension and files separated by a "tab" the "TSV" extension. There are also databases in these formats that receive the "TXT" extension. CSV files are simple and work in most applications that deal with structured data.

Making a comparison with rows and columns in a spreadsheet, the "records" in a CSV file are the rows and the "fields" are the columns. The first "record," which is the first line, usually contains column names for each of the "fields." Although an international standard does not exist for CSV, its variations are simple enough so that compatible applications can easily fix the differences. Typically, this is how a CSV file is displayed when opened in a text editor:

Continente;País;Capital
África;Angola;Luanda
América do Norte;Estados Unidos;Washington DC
América Central;México;Cidade do México
América do Sul;Brasil;Brasília
Europa;Espanha;Madri
Europa;Alemanha;Berlim
Oceania;Austrália;Camberra
Ásia;Japão;Tóquio

This file contains three columns separated by the semicolon [";"] delimiter: Continent, Country and Capital, as described in the first line. In all there are eight records. The first triad is Africa-Angola-Luanda and the last is Asia-Japan-Tokyo. There is no practical limit to the number of lines or columns in a CSV file. This number can reach millions or tens of millions, depending only on the processing power of the computer that will be used in querying. If the same CSV file was opened in a spreadsheet processor, it would be displayed like this:

ContinentePaísCapital
África Angola Luanda
América do Norte Estados Unidos Washington DC
América Central México Cidade do México
América do Sul Brasil Brasília
Europa Espanha Madri
Europa Alemanha Berlim
Oceania Austrália Camberra
Ásia Japão Tóquio

While it may seem annoying that a CSV can in fact be separated by something other than a comma, it is actually very convenient. Say, for example, your data contains commas in places that are not supposed to be used as separators [addresses for example, or currency, etc.]. This can be confusing to the computer program trying to read your data - and is in fact why quotation marks are used as text qualifiers, to help solve this problem. Instead of using commas both within your text fields and as a delimiter, you can opt to use another delimiter - one that is not used elsewhere within your data.

You can see this more practically when looking at the delimiter defaults by locale. For example, in Europe it is much more common to see the default delimiter as a semi-colon, because commas are used to separate decimals [in currency, for example].

To bring us back to the core concept:

The delimiter in your CSV is the character [comma or otherwise] that separates the data in your file into distinct fields. Practically speaking, it is what allows you to open your file in a spreadsheet and view the data in nicely organized columns and rows, and even what allows programs to import data from your CSV and place the data into the correct fields in your database. Delimiters are the key to the entire structure of CSV!

This blog was published as a part of Data Science Blogathon 7

Every Data Analysis project requires a dataset. These datasets are available in a various file formats such as .xlsx, .json, .csv, .html. Conventionally, datasets are mostly found in .csv format. CSV [or Comma Separated Values] files, as the name suggests, have data items separated by commas. CSV files are plain text files that are lighter in file size. Also, CSV files can be viewed and saved in tabular form in popular tools such as Microsoft Excel and Google Sheets.

The commas used in CSV files are known as delimiters. Think of delimiters as a separating boundary which distinguishes between any two subsequent data item.

Reading CSV Files using Pandas

To read these CSV files, we use a function of the Pandas library called read_csv[].

df = pd.read_csv[]

The read_csv[] function has tens of parameters out of which one is mandatory and others are optional to use on an ad hoc basis. This mandatory parameter specifies the CSV file we want to read. For example, 

Note: Remember to use double backward slashes while specifying the file path.

abc.csv file

[Source – Personal Computer]

The sep Parameter 

One of the optional parameters in read_csv[] is sep, a shortened name for separator. This operator is the delimiter we talked about before. This sep parameter tells the interpreter, which delimiter is used in our dataset or in Layman’s term, how the data items are separated in our CSV file.

The default value of the sep parameter is the comma [,] which means if we don’t specify the sep parameter in our read_csv[] function, it is understood that our file is using comma as the delimiter. Thus, in our previous code snippet, we did not specify the sep parameter, it was understood that our file has comma as delimiters.

Using Other Delimiters

Often it may happen, the dataset in .csv file format has data items separated by a delimiter other than a comma. This includes semicolon, colon, tab space, vertical bars, etc. In such cases, we need to use the sep parameter inside the read.csv[] function. For example, a file named Example.csv is a semicolon-separated CSV file.

Example.csv File

[Source – Personal Computer]

df = pd.read_csv["C:\Users\Rahul\Desktop\Example.csv", sep = ';']

On executing this code, we get a dataframe named df:

Dataframe df

[Source – Personal Computer]

Vertical-bar Separator

Thus, a vertical bar delimited file can be read by:

df = pd.read_csv["C:\Users\Rahul\Desktop\Example.csv", sep = '|']

Colon Separator

And a colon-delimited file can be read by:

df = pd.read_csv["C:\Users\Rahul\Desktop\Example.csv", sep = ':']

Tab Separator

Often we may come across the datasets having file format .tsv. These .tsv files have tab-separated values in them or we can say it has tab space as delimiter. Such files can be read using the same .read_csv[] function of pandas and we need to specify the delimiter. For example:

df = pd.read_csv["C:\Users\Rahul\Desktop\Example.tsv", sep = 't']

Similarly, other separators can be used based on identified delimiter from our data.

 

Conclusion

It is always useful to check how our data is being stored in our dataset. Understanding the data is necessary before starting working over it. A delimiter can be identified effortlessly by checking the data. Based on our inspection, we can use the relevant delimiter in the sep parameter.

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion. 

What is the delimiter used in CSV files?

When the field separator [delimiter] is a comma, the file is in comma-separated [CSV] or comma-delimited format. Another popular delimiter is the tab. If a field contains the delimiter character within its text, the program interprets this as the end of the field rather than as part of the text.

How do I separate Comma Separated Values in CSV?

Indicate separator directly in CSV file For this, open your file in any text editor, say Notepad, and type the below string before any other data: To separate values with comma: sep=, To separate values with semicolon: sep=;

Chủ Đề