Toolbox for MEDLINE files

This document explains installation and usage of the tools PMComp (PubMed File Compressor) and PMUniq (PubMed File Unique) for compression and removal of duplicate articles inside MEDLINE resp. PubMed data in PubMed format.

  1. Compression with PMComp
  2. Doublet Removal with PMUniq

Compression with PMComp

Description

Meva denies to work if the input data are very large (this limit can be changed by the administrator). In this case you can compress the data using PMComp (PubMed File Compressor) locally on your computer before sending them to Meva. The compression reduces your data usually onto 1-20 % of the original file size. F.i. if Meva is restricted to accept not more than 5 MB from the user, you can send virtual 25-500 MB to Meva anyway. In general, if you have large data to analyze or repeated consultations with same data but different parameters, it is recommended to use PMComp since it will decrease the net traffic load and increase the processing speed of client and server.

Despite the fact of the restricted input size, Meva has output limitations as well to protect web clients from dealing with megs of data. The output limitations relate only to HTML mode, not to text mode result.

Installation

Download the archiv (Windows x64: pmcomp.zip, 9 KB, MD5; Linux Elf x64: pmcomp.tgz, 5 KB, MD5) onto your computer. There is no special setup procedure needed. Unzip the archive and put the executable simply into a directory of your choice, f.i. in the same directory, which you download your PubMed result files into.

If your PubMed files always have a certain extension, you can facilitate the handling at a later time, if you double click onto such a file, choose “Other …” in the “Open with” dialogue and select the path to PMComp. All you have to do later on if you will start PMComp for that file is to perform a simple double click on this file.

Usage

PMComp compresses PubMed files by extracting only these fields the user wishes to analyze (and thereby dropping the others). So you must tell PMComp which input file it shall extract from, which fields shall be extracted and which output file shall these fields be written to. This is done by passing 4 parameters:

  1. Input file name, fi.i. pubmed_result.txt. This is your PubMed result file you have downloaded. Make sure this file is in MEDLINE text format.
  2. Short name of field 1, f.i. AD for Affiliation. If you leave it empty, PMComp assumes you to removing only abstracts, i.e. all other fields will be extracted and no information is lost. (Abstracts cannot be analyzed by Meva anyway, but allocate much space in the file.) Despite of the compression Meva can still evaluate then all fields.
  3. Short name of field 2, f.i. SO for Source. The default is NONE, i.e. PMComp will not extract a 2nd field.
  4. Output file name, f.i. pubmed_result-ad-so.meva. PMComp suggests a name consisting of the input file name extended by the short names of the fields to be extracted.

You can find the short names of the fields in the Field Help.

Interactive Usage

Drag and drop the file to be compressed onto the PMComp executable (or simply double click onto the file, if you have customized you File Explorer that way - see above). PMComp will start and ask for its outstanding parameters: Press either the ENTER key to accept the defaults or enter own values.

Three examples follow. You have searched in PubMed for articles related to Multiple Sclerosis published in 2000 and saved your result as ms2000.txt. This file shall now serve as input file for PMComp and PMComp shall generate a compressed output file serving as input for Meva.

(Hint: You will not be asked for File to read [pubmed_result.txt]: if you double clicked that file or dragged that file onto the executable since PMComp already knows the input file name in this case.)

In the 1st example, all fields excepted Abstracts shall be extracted. This is sensible if you want to perform different analyses with the same PubMed data:

PubMed File Compressor V1.2.0.1.

File to read [pubmed_result.txt]: ms2000.txt
Field 1 [Delete only AB's]:
File to write [ms2000.meva]:

Deleted only AB's.
Compressed 10238 KB of ms2000.txt onto 4576 KB (44%) in ms2000.meva.

Press a key to continue ...

In the 2nd example, only MeSH Terms (MH) and Authors (AU) shall be extracted, the rest is dropped. As you can see, PMComp suggests an output file name with trailing field short names:

PubMed File Compressor V1.2.0.1.

File to read [pubmed_result.txt]: ms2000.txt
Field 1 [Delete only AB's]: MH
Field 2 [NONE]: AU
File to write [ms2000-mh-au.meva]:

Counted 3908 records.
Extracted 41488 'MH's.
Extracted 17267 'AU's.
Compressed 10238 KB of ms2000.txt onto 1586 KB (15%) in ms2000-mh-au.meva.

Press a key to continue ...

In our last example, only countries (CY) shall be extracted. The compression ratio of 100:1 is superb, however you can consult Meva now with this file only for countries. If you are interested in analyzing other fields as well, PMComp had to be run again with different field name parameters:

PubMed File Compressor V1.2.0.1.

File to read [pubmed_result.txt]: ms2000.txt
Field 1 [Delete only AB's]: CY
Field 2 [NONE]:
File to write [ms2000-cy.meva]:

Counted 3908 records.
Extracted 3853 'CY's.
Compressed 10238 KB of ms2000.txt onto 127 KB (1%) in ms2000-cy.meva.

Press a key to continue ...

Command Line Mode Usage

Alternatively you can pass PMComp its parameters directly on the command line (DOS prompt, command shell or whatever you call it). If not supplied, PMComp will ask for the missing parameters. Entering pmcomp -? on the command line will print a little help screen:

PubMed File Compressor V1.2.0.1, (c) 2002, 2022 med-ai.com.
This tool compresses a PubMed result file downloaded in 'PubMed' format.

Syntax:
  pmcomp [-?] [pubmedfile] [fieldname1] [fieldname2] [outfile]

Examples:
  Extract Authors and MeSH codes: pmcomp pubmed_res.txt AU MH out.txt
  Extract Authors               : pmcomp pubmed_res.txt AU NONE out.txt
  Remove only abstracts         : pmcomp pubmed_res.txt NONE NONE out.txt

With all parameters set on the command line, the program runs quietly;
otherwise the program interactively asks for missing values.

Doublet Removal with PMUniq

Description

Joining PubMed result files can result in duplicate articles. PMUniq (PubMed File Unique) finds, displays and removes – like the Unix tool uniq – these doublets.

Installation

Download the archiv (Windows x86: pmuniq.zip, 17 KB, MD5) onto your computer. There is no special setup procedure needed. Extract the executable from the archive and put it simply into a directory of your choice, f.i. in the same directory, which you download your PubMed result files into.

Usage

Tell PMUniq the name of the Pubmed result file and the desired output file:

  1. Input file name, f.i. in.txt. Make sure this file is in MEDLINE text format.
  2. Output file name, f.i. in-unique.txt. PMUniq suggests a name consisting of the input file name suffixed by -unique.

Interactive Usage

Drag and drop the file to be compressed onto the PMUniq executable. PMUniq will start and ask for outstanding parameters: Press either the ENTER key to accept the defaults or enter own values as the example below shows. PMUniq displays the article numbers of duplicate records inside ms.txt, removes the doublets and writes the corrected data into ms-unique.txt:

* PubMed Unique V1.0. Type pmuniq -? for help.

File to read [pubmed_result.txt]: ms.txt
File to write [ms-unique.txt]:

PMID's of duplicate records:
23732945
21258057
Found 2 doublets in 93 records in ms.txt, saved corrected data to ms-unique.txt.

Press a key to continue ...

Command Line Mode Usage

Alternatively you can pass PMUniq its parameters directly on the DOS command line. If not supplied, PMUniq will ask for the missing parameters. Entering pmuniq -? on the command line will print a little help screen:

* PubMed Unique V1.0. Cop. (c) 2013 www.med-ai.com.

Remove duplicate records from PubMed result files (MEDLINE text format).

Syntax: pmuniq [-?] [pubmedfile] [outfile]